Goto

Collaborating Authors

 data service


Data Virtualization for Machine Learning

Khan, Saiful, Chakraborty, Joyraj, Beaucamp, Philip, Bhujel, Niraj, Chen, Min

arXiv.org Artificial Intelligence

Nowadays, machine learning (ML) teams have multiple concurrent ML workflows for different applications. Each workflow typically involves many experiments, iterations, and collaborative activities and commonly takes months and sometimes years from initial data wrangling to model deployment. Organizationally, there is a large amount of intermediate data to be stored, processed, and maintained. \emph{Data virtualization} becomes a critical technology in an infrastructure to serve ML workflows. In this paper, we present the design and implementation of a data virtualization service, focusing on its service architecture and service operations. The infrastructure currently supports six ML applications, each with more than one ML workflow. The data virtualization service allows the number of applications and workflows to grow in the coming years.


A2CI: A Cloud-based, Service-oriented Geospatial Cyberinfrastructure to Support Atmospheric Research

Li, Wenwen, Shao, Hu, Wang, Sizhe, Zhou, Xiran, Wu, Sheng

arXiv.org Artificial Intelligence

In recent years, atmospheric research has received increasing attention from environmental experts and the public because atmospheric phenomena such as El Nino, global warming, ozone depletion, and drought that may have negative effects on the Earth's climate and ecosystem are occurring more often (Walther et al. 2002; Karl and Trenberth 2003; Trenberth et al. 2014). In order to model the status quo and predict the trend of atmospheric phenomena and events, researchers need to retrieve data from various relevant domains, such as chemical components of aerosols and gases, the terrestrial surface, energy consumption, the hydrosphere, the biosphere, etc. (Schneider, 2006; Fowler et al., 2009; Guilyardi et al, 2009; Ramanathan et al., 2011; Katul et al., 2012). In complex earth system modeling, the data and services for atmospheric study present the characteristics of being distributed, collaborative and adaptive (Plale et al., 2006). The massive volume, rapid velocity and wide variety of data has led to a new era of atmospheric research that consists of accessing and integrating big data from distributed sources, conducting collaborative analysis in an interactive way, providing intelligent services for data management, and integration and visualization to foster discovery of hidden or new knowledge. One of the most important ways to support these activities is to establish a national or international spatial data infrastructure and geospatial cyberinfrastructure on which the data and computational resources can be easily shared, the spatial analysis tool can be executed on-the-fly and the scientific results can be effectively visualized (Yang et al., 2008; Li et al., 2011). Technically, a geospatial cyberinfrastructure (GCI) is an architecture that effectively utilizes geo-referenced data to connect people, information and computers based on the standardized data access protocols, high speed internet, high-performance computing facilities (HPC) and service-oriented data management (Yang et al., 2010). Since the concept's official introduction by the National Science Foundation (NSF) in its 2003 blue ribbon report, cyberinfrastructure research has attracted much attention from the atmospheric science domain because of its promise of bringing paradigm change for


tf.data service: A Case for Disaggregating ML Input Data Processing

Audibert, Andrew, Chen, Yang, Graur, Dan, Klimovic, Ana, Simsa, Jiri, Thekkath, Chandramohan A.

arXiv.org Artificial Intelligence

Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.


Bi-directional personalization reinforcement learning-based architecture with active learning using a multi-model data service for the travel nursing industry

Beyenne, Ezana N.

arXiv.org Artificial Intelligence

The challenges of using inadequate online recruitment systems can be addressed with machine learning and software engineering techniques. Bi-directional personalization reinforcement learning-based architecture with active learning can get recruiters to recommend qualified applicants and also enable applicants to receive personalized job recommendations. This paper focuses on how machine learning techniques can enhance the recruitment process in the travel nursing industry by helping speed up data acquisition using a multi-model data service and then providing personalized recommendations using bi-directional reinforcement learning with active learning. This need was especially evident when trying to respond to the overwhelming needs of healthcare facilities during the COVID-19 pandemic. The need for traveling nurses and other healthcare professionals was more evident during the lockdown period. A data service was architected for job feed processing using an orchestration of natural language processing (NLP) models that synthesize job-related data into a database efficiently and accurately. The multi-model data service provided the data necessary to develop a bi-directional personalization system using reinforcement learning with active learning that could recommend travel nurses and healthcare professionals to recruiters and provide job recommendations to applicants using an internally developed smart match score as a basis. The bi-directional personalization reinforcement learning-based architecture with active learning combines two personalization systems - one that runs forward to recommend qualified candidates for jobs and another that runs backward and recommends jobs for applicants.


Data Engineer at Enova - Chicago, IL

#artificialintelligence

Enova is currently accepting candidates for remote positions in the following eligible states: AL, AK, AR, AZ, CT, GA, IA, ID, IL, IN, KY, LA, MA, ME, MD, MI, MN, MO, MS, NC, ND, NE, NH, NV, NJ, NM, OH, OK, OR, PA, RI, SC, SD, TN, UT, VT, WI, WV, WY. The Data Services Team effectively and sustainably builds data strategy and provides data solutions and tools across the organization. We integrate, transform, and improve volumes of data at the project or enterprise level for streamlined processes, greater efficiencies, and smarter, more informed decision-making. This team is high-energy, dynamic and in a business-critical domain space. This role is an opportunity to make a real difference in the data space, and we need confident, experienced people eager to bring in solutions, with a demonstrated ability to learn fast and make that happen.


Why composability is key to scaling digital twins

#artificialintelligence

Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Digital twins enable enterprises to model and simulate buildings, products, manufacturing lines, facilities and processes. This can improve performance, quickly flag quality errors and support better decision-making. Today, most digital twin projects are one-off efforts.


Overcoming ML Data Preprocessing Bottlenecks With gRPC

#artificialintelligence

One of the measures of the health of a deep learning project is the degree to which it utilizes the training resources that it was allocated. Whether you are training in the cloud or on your own private infrastructure, training resources cost money, and any block of time in which they are left idle represents a potential opportunity to increase training throughput and overall productivity. This is particularly true for the training accelerator -- typically the most expensive training resource -- whether it be a GPU, a Google TPU, or a Habana Gaudi. This blog is a sequel to a previous post on the topic of Overcoming Data Preprocessing Bottlenecks in which we addressed the undesired scenario in which your training accelerator, henceforth assumed to be a GPU, finds itself idle while it waits for data input from an overly tasked CPU. The post covered several different ways of addressing this type of bottleneck and demonstrated them on a toy example, all the while emphasizing that the best option would very much depend on the specifics of the model and project at hand.


The Seattle Report on Database Research

Communications of the ACM

From the inception of the field, academic database research has strongly influenced the database industry and vice versa. The database community, both research and industry, has grown substantially over the years. The relational database market alone has revenue upwards of $50B. On the academic front, database researchers continue to be recognized with significant awards. Over the last decade, our research community pioneered the use of columnar storage, which is used in all commercial data analytic platforms. Database systems offered as cloud services have witnessed explosive growth. Hybrid transactional/analytical processing (HTAP) systems are now an important segment of the industry. Furthermore, memory-optimized data structures, modern compilation, and code-generation have significantly enhanced performance of traditional database engines. All data platforms have embraced SQL-style APIs as the predominant way to query and retrieve data. Database researchers have played an important part in influencing the evolution of streaming data platforms as well as distributed key-value stores. A new generation of data cleaning and data wrangling technology is being actively explored.


Embrace Complexity (Part 4)

#artificialintelligence

Data, Cloud and AI are three fundamental forces that are fuelling an exciting (and at times scary) technological phase transition where we all move from the smog of the industrial age into the neon glow of the information age. If you have been following this series then you will know that we have been using the network shape like a north star that guides us as we search for a unified theory that connects these three forces together. A unified theory that can be used by all organisations so that fewer of them are left behind. The graph-shaped data introduced in part 2 provides us with networked information and networked information captures a richness and complexity that can only be found in the relationships that connect the parts together. The nuance and subtlety that exists in the lines connecting the dots.


The Data Engineering Pipeline

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Data Engineers are at the heart of the engine room of any data-driven company.