apache
Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages
Yang, Ivory, Ma, Weicheng, Zhang, Chunhui, Vosoughi, Soroush
Endangered languages, such as Navajo - the most widely spoken Native American language - are significantly underrepresented in contemporary language technologies, exacerbating the challenges of their preservation and revitalization. This study evaluates Google's Language Identification (LangID) tool, which does not currently support any Native American languages. To address this, we introduce a random forest classifier trained on Navajo and twenty erroneously suggested languages by LangID. Despite its simplicity, the classifier achieves near-perfect accuracy (97-100%). Additionally, the model demonstrates robustness across other Athabaskan languages - a family of Native American languages spoken primarily in Alaska, the Pacific Northwest, and parts of the Southwestern United States - suggesting its potential for broader application. Our findings underscore the pressing need for NLP systems that prioritize linguistic diversity and adaptability over centralized, one-size-fits-all solutions, especially in supporting underrepresented languages in a multicultural world. This work directly contributes to ongoing efforts to address cultural biases in language models and advocates for the development of culturally localized NLP tools that serve diverse linguistic communities.
CybORG++: An Enhanced Gym for the Development of Autonomous Cyber Agents
Emerson, Harry, Bates, Liz, Hicks, Chris, Mavroudis, Vasilios
CybORG++ is an advanced toolkit for reinforcement learning research focused on network defence. Building on the CAGE 2 CybORG environment, it introduces key improvements, including enhanced debugging capabilities, refined agent implementation support, and a streamlined environment that enables faster training and easier customisation. Along with addressing several software bugs from its predecessor, CybORG++ introduces MiniCAGE, a lightweight version of CAGE 2, which improves performance dramatically, up to 1000x faster execution in parallel iterations, without sacrificing accuracy or core functionality. CybORG++ serves as a robust platform for developing and evaluating defensive agents, making it a valuable resource for advancing enterprise network defence research.
Near to Mid-term Risks and Opportunities of Open-Source Generative AI
Eiras, Francisco, Petrov, Aleksandar, Vidgen, Bertie, de Witt, Christian Schroeder, Pizzati, Fabio, Elkins, Katherine, Mukhopadhyay, Supratik, Bibi, Adel, Csaba, Botos, Steibel, Fabro, Barez, Fazl, Smith, Genevieve, Guadagni, Gianluca, Chun, Jon, Cabot, Jordi, Imperial, Joseph Marvin, Nolazco-Flores, Juan A., Landay, Lori, Jackson, Matthew, Rรถttger, Paul, Torr, Philip H. S., Darrell, Trevor, Lee, Yong Suk, Foerster, Jakob
In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation is likely to put at risk the budding field of open-source Generative AI. We argue for the responsible open sourcing of generative AI models in the near and medium term. To set the stage, we first introduce an AI openness taxonomy system and apply it to 40 current large language models. We then outline differential benefits and risks of open versus closed source AI and present potential risk mitigation, ranging from best practices to calls for technical and scientific contributions. We hope that this report will add a much needed missing voice to the current public discourse on near to mid-term AI safety and other societal impact.
Pub/Sub Message Brokers for GenAI
Saleh, Alaa, Pirttikangas, Susanna, Lovรฉn, Lauri
In today's digital world, Generative Artificial Intelligence (GenAI) such as Large Language Models (LLMs) is becoming increasingly prevalent, extending its reach across diverse applications. This surge in adoption has sparked a significant increase in demand for data-centric GenAI models, highlighting the necessity for robust data communication infrastructures. Central to this need are message brokers, which serve as essential channels for data transfer within various system components. This survey aims to delve into a comprehensive analysis of traditional and modern message brokers, offering a comparative study of prevalent platforms. Our study considers numerous criteria including, but not limited to, open-source availability, integrated monitoring tools, message prioritization mechanisms, capabilities for parallel processing, reliability, distribution and clustering functionalities, authentication processes, data persistence strategies, fault tolerance, and scalability. Furthermore, we explore the intrinsic constraints that the design and operation of each message broker might impose, recognizing that these limitations are crucial in understanding their real-world applicability. We then leverage these insights to propose a sophisticated message broker framework -- one designed with the adaptability and robustness necessary to meet the evolving requisites of GenAI applications. Finally, this study examines the enhancement of message broker mechanisms specifically for GenAI contexts, emphasizing the criticality of developing a versatile message broker framework. Such a framework would be poised for quick adaptation, catering to the dynamic and growing demands of GenAI in the foreseeable future. Through this dual-pronged approach, we intend to contribute a foundational compendium that can guide future innovations and infrastructural advancements in the realm of GenAI data communication.
Apache Submarine: A Unified Machine Learning Platform Made Simple
Chen, Kai-Hsun, Su, Huan-Ping, Chuang, Wei-Chiu, Hsiao, Hung-Chang, Tan, Wangda, Tang, Zhankun, Liu, Xun, Liang, Yanbo, Lo, Wen-Chih, Ji, Wanqiang, Hsu, Byron, Hu, Keqiu, Jian, HuiYang, Zhou, Quan, Wang, Chien-Min
As machine learning is applied more widely, it is necessary to have a machine learning platform for both infrastructure administrators and users including expert data scientists and citizen data scientists to improve their productivity. However, existing machine learning platforms are ill-equipped to address the "Machine Learning tech debts" such as glue code, reproducibility, and portability. Furthermore, existing platforms only take expert data scientists into consideration, and thus they are inflexible for infrastructure administrators and non-user-friendly for citizen data scientists. We propose Submarine, a unified machine learning platform, to address the challenges.
AI Fueling The Oil And Gas Industry: Interview With Tim Custer At Apache
In industries where data is key to gaining competitive advantage, artificial intelligence and machine learning have become necessities. This is most definitely the case in the oil and gas industries that ebb and flow over time as market demand waxes and wanes for critical resources we've come to depend on. After taking the role of land manager for the past ten years, Custer has shared how tied to real estate and traditional non-energy businesses the oil and gas sector is, and the role that machine learning and AI is playing to greatly change the way that the energy industry deals with documents. According to Custer, AI and machine learning are extracting valuable data from unstructured data. The oil and gas industry is particularly dependent on an intricate set of processes and document-centric needs for land leases.
Rule As a Code -- SureLog Correlation Engine and Beyond
SureLog SIEM is a security platform which differs from many SIEM products. The main difference is; correlation engine which you can develop your own logic with a High-Level Domain-specific Language. There is no restriction in the logic because you can develop your logic in JAVA including Machine learning, statistical methods and artificial intelligence. SureLog is ready for the fallowing ML libraries also. SureLog has a correlation engine and has a feature called Rule As a Code which is Rule Code.
Exercise Forging Sabre: Apache, fighter pilots get enemy data faster with help of AI
BOISE, Idaho: Soaring silently in the sky, the Heron 1 unmanned aerial vehicle (UAV) spots three moving vehicles below suspected to be enemy targets. The UAV feeds real-time video back to a big screen in the command post. Commanders there immediately see red rectangles appear around the vehicles. This is the Automatic Target Detection (ATD) system confirming they are threats. Three F-16 fighter jets are scrambled.
Import2vec - Learning Embeddings for Software Libraries
Theeten, Bart, Vandeputte, Frederik, Van Cutsem, Tom
We consider the problem of developing suitable learning representations (embeddings) for library packages that capture semantic similarity among libraries. Such representations are known to improve the performance of downstream learning tasks (e.g. classification) or applications such as contextual search and analogical reasoning. We apply word embedding techniques from natural language processing (NLP) to train embeddings for library packages ("library vectors"). Library vectors represent libraries by similar context of use as determined by import statements present in source code. Experimental results obtained from training such embeddings on three large open source software corpora reveals that library vectors capture semantically meaningful relationships among software libraries, such as the relationship between frameworks and their plug-ins and libraries commonly used together within ecosystems such as big data infrastructure projects (in Java), front-end and back-end web development frameworks (in JavaScript) and data science toolkits (in Python).
US Army starts work on future attack-recon helicopter
The Army is now crafting early requirements for what is expected to be a new attack helicopter -- beyond the Apache -- with superior weapons, speed, maneuverability, sensor technology and vastly-improved close-combat attack capability. "We know that in the future we are going to need to have a lethal capability, which drives us to a future attack reconnaissance platform. The Apache is the world's greatest but there will come a time when we look at leap ahead technology," Army Vice Chief of Staff Gen. James McConville told a small group of reporters. A future attack-reconnaissance helicopter, now in its conceptual phase, is a key part of a wide-spanning, multi-aircraft Army Future Vertical Lift (FVL) program. FVL seeks a family of next-generation aircraft to begin emerging in the 2030s, consisting of attack, utility and heavy-class air assets.