Information Fusion
Migrating from AWS Glue to BigQuery for ETL
Our journey with AWS Glue was a bit of a struggle once we started to dig deeper into the streaming functionality of it, the orchestration of so many layers added a huge overhead that we weren't expecting and whilst most of that is handled within the AWS suite of products, there are just too many benefits to switching our pipelines over to GCP and BigQuery to be ignored. Next steps are to finalise our deployment by using Cloud Composer (Airflow) to orchestrate the creation of each of the tables and provide a monitoring dashboard to help us detect failures and act on them. I will say that AWS got in touch with me after my previous article and I got on a call with the AWS Glue product team, in their words I had "hit pretty much every sharp edge possible" (seems to be a running theme with me -- perhaps I should switch careers to QA engineer?),
Detecting Safety Problems of Multi-Sensor Fusion in Autonomous Driving
Zhong, Ziyuan, Hu, Zhisheng, Guo, Shengjian, Zhang, Xinyang, Zhong, Zhenyu, Ray, Baishakhi
Autonomous driving (AD) systems have been thriving in recent years. In general, they receive sensor data, compute driving decisions, and output control signals to the vehicles. To smooth out the uncertainties brought by sensor inputs, AD systems usually leverage multi-sensor fusion (MSF) to fuse the sensor inputs and produce a more reliable understanding of the surroundings. However, MSF cannot completely eliminate the uncertainties since it lacks the knowledge about which sensor provides the most accurate data. As a result, critical consequences might happen unexpectedly. In this work, we observed that the popular MSF methods in an industry-grade Advanced Driver-Assistance System (ADAS) can mislead the car control and result in serious safety hazards. Misbehavior can happen regardless of the used fusion methods and the accurate data from at least one sensor. To attribute the safety hazards to a MSF method, we formally define the fusion errors and propose a way to distinguish safety violations causally induced by such errors. Further, we develop a novel evolutionary-based domain-specific search framework, FusionFuzz, for the efficient detection of fusion errors. We evaluate our framework on two widely used MSF methods. %in two driving environments. Experimental results show that FusionFuzz identifies more than 150 fusion errors. Finally, we provide several suggestions to improve the MSF methods under study.
On Event-Driven Knowledge Graph Completion in Digital Factories
Ringsquandl, Martin, Kharlamov, Evgeny, Stepanova, Daria, Lamparter, Steffen, Lepratti, Raffaello, Horrocks, Ian, Krรถger, Peer
Smart factories are equipped with machines that can sense their manufacturing environments, interact with each other, and control production processes. Smooth operation of such factories requires that the machines and engineering personnel that conduct their monitoring and diagnostics share a detailed common industrial knowledge about the factory, e.g., in the form of knowledge graphs. Creation and maintenance of such knowledge is expensive and requires automation. In this work we show how machine learning that is specifically tailored towards industrial applications can help in knowledge graph completion. In particular, we show how knowledge completion can benefit from event logs that are common in smart factories. We evaluate this on the knowledge graph from a real world-inspired smart factory with encouraging results.
How low-code platforms enable machine learning
Low-code platforms improve the speed and quality of developing applications, integrations, and data visualizations. Instead of building forms and workflows in code, low-code platforms provide drag-and-drop interfaces to design screens, workflows, and data visualizations used in web and mobile applications. Low-code integration tools support data integrations, data prep, API orchestrations, and connections to common SaaS platforms. If you're designing dashboards and reports, there are many low-code options to connect to data sources and create data visualizations. If you can do it in code, there's probably a low-code or no-code technology that can help accelerate the development process and simplify ongoing maintenance.
Multi-Agent Variational Occlusion Inference Using People as Sensors
Itkina, Masha, Mun, Ye-Ji, Driggs-Campbell, Katherine, Kochenderfer, Mykel J.
Autonomous vehicles must reason about spatial occlusions in urban environments to ensure safety without being overly cautious. Prior work explored occlusion inference from observed social behaviors of road agents. Inferring occupancy from agent behaviors is an inherently multimodal problem; a driver may behave in the same manner for different occupancy patterns ahead of them (e.g., a driver may move at constant speed in traffic or on an open road). Past work, however, does not account for this multimodality, thus neglecting to model this source of aleatoric uncertainty in the relationship between driver behaviors and their environment. We propose an occlusion inference method that characterizes observed behaviors of human agents as sensor measurements, and fuses them with those from a standard sensor suite. To capture the aleatoric uncertainty, we train a conditional variational autoencoder with a discrete latent space to learn a multimodal mapping from observed driver trajectories to an occupancy grid representation of the view ahead of the driver. Our method handles multi-agent scenarios, combining measurements from multiple observed drivers using evidential theory to solve the sensor fusion problem. Our approach is validated on a real-world dataset, outperforming baselines and demonstrating real-time capable performance. Our code is available at https://github.com/sisl/MultiAgentVariationalOcclusionInference .
Bayesian data combination model with Gaussian process latent variable model for mixed observed variables under NMAR missingness
Mitsuhiro, Masaki, Hoshino, Takahiro
In the analysis of observational data in social sciences and businesses, it is difficult to obtain a "(quasi) single-source dataset" in which the variables of interest are simultaneously observed. Instead, multiple-source datasets are typically acquired for different individuals or units. Various methods have been proposed to investigate the relationship between the variables in each dataset, e.g., matching and latent variable modeling. It is necessary to utilize these datasets as a single-source dataset with missing variables. Existing methods assume that the datasets to be integrated are acquired from the same population or that the sampling depends on covariates. This assumption is referred to as missing at random (MAR) in terms of missingness. However, as will been shown in application studies, it is likely that this assumption does not hold in actual data analysis and the results obtained may be biased. We propose a data fusion method that does not assume that datasets are homogenous. We use a Gaussian process latent variable model for non-MAR missing data. This model assumes that the variables of concern and the probability of being missing depend on latent variables. A simulation study and real-world data analysis show that the proposed method with a missing-data mechanism and the latent Gaussian process yields valid estimates, whereas an existing method provides severely biased estimates. This is the first study in which non-random assignment to datasets is considered and resolved under resonable assumptions in data fusion problem.
Rapidly and accurately estimating brain strain and strain rate across head impact types with transfer learning and data fusion
Zhan, Xianghao, Liu, Yuzhe, Cecchi, Nicholas J., Gevaert, Olivier, Zeineh, Michael M., Grant, Gerald A., Camarillo, David B.
Brain strain and strain rate are effective in predicting traumatic brain injury (TBI) caused by head impacts. However, state-of-the-art finite element modeling (FEM) demands considerable computational time in the computation, limiting its application in real-time TBI risk monitoring. To accelerate, machine learning head models (MLHMs) were developed, and the model accuracy was found to decrease when the training/test datasets were from different head impacts types. However, the size of dataset for specific impact types may not be enough for model training. To address the computational cost of FEM, the limited strain rate prediction, and the generalizability of MLHMs to on-field datasets, we propose data fusion and transfer learning to develop a series of MLHMs to predict the maximum principal strain (MPS) and maximum principal strain rate (MPSR). We trained and tested the MLHMs on 13,623 head impacts from simulations, American football, mixed martial arts, car crash, and compared against the models trained on only simulations or only on-field impacts. The MLHMs developed with transfer learning are significantly more accurate in estimating MPS and MPSR than other models, with a mean absolute error (MAE) smaller than 0.03 in predicting MPS and smaller than 7 (1/s) in predicting MPSR on all impact datasets. The MLHMs can be applied to various head impact types for rapidly and accurately calculating brain strain and strain rate. Besides the clinical applications in real-time brain strain and strain rate monitoring, this model helps researchers estimate the brain strain and strain rate caused by head impacts more efficiently than FEM.
ETL Developer
Our Data & Analytics team is ready for you to join iOLAP in the role of an ETL Developer! You will be joining a team of equally passionate and skilled data engineers, architects, designers, DevOps engineers who are working with the latest technologies on exciting projects. Building upon our 20-year strong global experience and deep expertise across multiple industry verticals, we are focused on creating solutions that bring efficiency, security, and scale to our clients. In this role, you will be responsible for collecting, transforming, and sending data through the chain in the proper format up to the warehouse level. You will help to build efficient and stable data pipelines which can be easily maintained in the future.
Data Engineer (10+)
We are looking for a colleague passionate about building data platforms, business insights, storytelling, narrative, heavy data lifting, analytics, and, generally, helping data-driven products become alive. Data engineering tasks will range from working on third-party integrations, implementing ETL processes, designing data pipelines and data lakes, automating and orchestrating computations, and building data-intensive systems. If this sounds interesting to you and you do not like to be constrained by a single programming language or tool choice, then chances are we are a good fit for each other. This position is open for all of our development centers.
Detection of Illicit Drug Trafficking Events on Instagram: A Deep Multimodal Multilabel Learning Approach
Hu, Chuanbo, Yin, Minglei, Liu, Bin, Li, Xin, Ye, Yanfang
Social media such as Instagram and Twitter have become important platforms for marketing and selling illicit drugs. Detection of online illicit drug trafficking has become critical to combat the online trade of illicit drugs. However, the legal status often varies spatially and temporally; even for the same drug, federal and state legislation can have different regulations about its legality. Meanwhile, more drug trafficking events are disguised as a novel form of advertising commenting leading to information heterogeneity. Accordingly, accurate detection of illicit drug trafficking events (IDTEs) from social media has become even more challenging. In this work, we conduct the first systematic study on fine-grained detection of IDTEs on Instagram. We propose to take a deep multimodal multilabel learning (DMML) approach to detect IDTEs and demonstrate its effectiveness on a newly constructed dataset called multimodal IDTE(MM-IDTE). Specifically, our model takes text and image data as the input and combines multimodal information to predict multiple labels of illicit drugs. Inspired by the success of BERT, we have developed a self-supervised multimodal bidirectional transformer by jointly fine-tuning pretrained text and image encoders. We have constructed a large-scale dataset MM-IDTE with manually annotated multiple drug labels to support fine-grained detection of illicit drugs. Extensive experimental results on the MM-IDTE dataset show that the proposed DMML methodology can accurately detect IDTEs even in the presence of special characters and style changes attempting to evade detection.