Spatial Reasoning
Data Science: Feature Engineering with Spatial Flavour – Sp.4ML
Imagine that we are working for a real estate agency and our role is to estimate price of apartment renting in a different parts of the New York City. In classic machine learning approach we work through those variables and build model for prediction of a price. We will consider the last example and learn how to retrieve spatial information using GeoPandas package and publicly available geographical datasets. Data for this article is shared in Kaggle. Data is also available in the blogpost repository.
Biologists revel in pinpointing active genes in tissue samples
As with real estate, location matters greatly for cells. Douglas Strand confirmed that truth last year when he used a new technique to map gene activity in bladder cancers. Until recently, scientists wanting to know all the genes at work in a tissue could analyze single cells without knowing their position, or they could measure average activity levels of genes across thousands of cells. Now, an emerging technology called spatial transcriptomics combines precision and breadth, mapping the work of thousands of genes in individual cells at pinpoint locations in tissue. That, Strand says, has been a “total game changer” for his research. The virtual Advances in Genome Biology and Technology (AGBT) meeting this month was a big coming-out party for the technique, which is revealing whole new landscapes of gene expression. Strand, for example, reported finding that cells surrounding bladder tumors, though outwardly normal, display many of the same gene activity changes as the cancer. “They looked more like tumor than normal tissue,” says Strand, who works at the University of Texas Southwestern Medical Center. He found surprises within the tumors, too: hidden patterns of gene activity suggesting some of the cells are more likely than others to spread beyond the bladder. Other biologists at the meeting reported using the technique to study Alzheimer's disease, track the dynamics of different types of T cells, and study lung, heart, and other tissues in COVID-19 patients. “The field is developing very, very fast,” says Aparna Bhaduri of the University of California, Los Angeles, who uses it to examine developing human brains. Scientists studying cells have long been able to examine the activity of a few, select genes in intact tissue—for example, by engineering a gene to tack on a fluorescent tag to the protein it encodes. By 2010, traditional transcriptomics, which examines cellular activity of many, if not all, known genes by probing for the messenger RNA (mRNA) transcripts they encode, took off. But those studies require tissues to be ground up first, so the data represent the average activity of genes in millions of cells. More recently, biologists have begun to monitor all the genes of single cells, uncovering vast differences in gene activity between different cell types and variation even within types. But because those cells are extracted from tissue with enzymes or teased out with lasers, microscopic tweezers, or other methods, the influence of their precise location and neighbor cells is lost. “We could see the individual parts, but we didn't know how the parts fit together,” explains Joseph Beechem, a biophysicist at NanoString Technologies, a leading company for spatial transcriptomics and related methods. Then in 2016, Swedish researchers described in Science how they managed to keep track of cells' locations while assessing the activity of about 200 of their genes ( Science , 1 July 2016, p. [78][1]). The group put thin slices of a tissue onto slides precoated with short, known sequences of DNA, meant to act like identifiable barcodes, attached to other DNA designed to latch nonspecifically onto any mRNA nearby. The team treated the tissue with detergent to make cells leak their mRNA, which linked to the anchored, barcoded DNA, marking which cell the mRNA came from. Then, they added enzymes and DNA bases to the slice to translate each mRNA into a complementary DNA strand. Sequencing that strand along with its position-identifying barcode revealed the active parent gene and its position. Those data enabled computer programs to reconstruct the tissue locations of all the active genes. Multiple companies have begun to sell expensive machines that conduct such spatial transcriptomics analyses, making it possible to study thousands of genes in hundreds of cells in their proper places. That “can tell you a lot about how cell communication might break down in disease,” says Aviv Regev, a computation and systems biologist who heads the Genentech Research and Early Development unit of Roche. Christopher Mason, a geneticist at Weill Cornell Medicine (WCM), and colleagues have performed spatial transcriptomics on fresh or preserved tissue samples from autopsied COVID-19 patients, comparing them with lung tissue of healthy adults and people who died of other acute respiratory infections or flu. The commercial devices they used, one based on the Swedish approach, can assess lots of genes but can't completely pinpoint their activity to single cells. (Other methods are limited to far fewer genes, but specify locations better.) The team, including WCM's Robert Schwartz and Alain Borczuk and others, mapped the activity of the gene for angiotensin-converting enzyme 2, the cell-surface receptor targeted by SARS-CoV-2, and other identifying immune cells called macrophages and neutrophils. In normal lung tissue, macrophages make up less than 4% of the cells; in COVID-19 lungs, they sometimes topped 50%, Mason reported at the AGBT meeting. The lung itself changes as well, he and his colleagues discovered by looking at gene activity in these lung samples. Late in the disease, the organ's normal cellular architecture was disrupted, and cells adjacent to blood vessels had changed. The WCM group and others have done spatial analyses of gene activity for other parts of the COVID-19–ravaged body. The coronavirus seems to turn off genes in nasal cells that sense smells and causes a reorganization of the cells in the lining of the nose; those changes may contribute to the loss of smell and taste infected people often experience. The hearts of COVID-19 patients also betrayed an impact. Under the microscope they appear to have a normal number of muscle cells, Mason says, “but if you look at gene expression, it seems the cells have forgotten what they are supposed to be doing.” Stanford University neuroscientist Andrew Yang has done a similar gene activity comparison of preserved human brain tissue to understand why some people with protein deposits called amyloid plaques don't develop Alzheimer's disease and others do. In tissue from Alzheimer's patients, nonneuronal cells close to these plaques show increased activity of genes whose proteins mark nerve cell connections called synapses for destruction. Other revved up genes suggest increased action by scavenger cells called microglia, which prune synapses and cause potentially harmful inflammation. “We're beginning to understand what makes for a good or bad response to these aggregates,” Yang says. These early results only begin to address the potential of spatial transcriptomics. The current methods don't yet work robustly in all types of tissues, and analyses can take days to complete. Companies continue to upgrade their instruments, but so far, none can really quantify all the active genes in a tissue at the single-cell level. At about $300,000 each, some of the machines are also prohibitively expensive for many labs. The Broad Institute has come up with a cheaper DIY version. Called “Slide-seq,” the technique uses a layer of tiny beads, coated with pieces of barcoded DNA, on a slide to help mark the positions of mRNA from thousands of genes ( Science , 29 March 2019, p. [1463][2]). At the AGBT meeting, Broad genomicist Robert Stickels described version 2.0, which crams much more DNA onto each bead and can put up to 1 million beads on a slide, making the gene-activity mapping more precise by an order of magnitude. The entire protocol is public, Stickels says. “It really empowers other labs to do it.” For example, Abhishek Sampath Kumar, a graduate student at the Max Planck Institute for Molecular Genetics, now gets slides from Broad. “This technique is easy to apply,” says Kumar, who is studying mammalian heart development. “You don't need any special instruments compared to other methods.” Both industrial and academic labs are racing to improve spatial transcriptomics and to extend cell-by-cell mapping to other key indicators. “Soon there will be technologies that give you more and more types of data all together at the same time, spatial information, RNA, DNA, chromatin, protein, temporal information about cellular histories, metabolite profiling, you name it, at single-cell resolution,” Stickels predicts. Many biologists are thrilled at the prospects. “I think we will be rewriting the textbook on how organisms develop, and we are going to understand how the body responds to drugs in a way that nobody has been able to do before,” Beechem says. “Spatial biology is providing the next revolution in biology.” [1]: http://www.sciencemag.org/content/353/6294/78 [2]: http://www.sciencemag.org/content/363/6434/1463
Hippocampal formation-inspired probabilistic generative model
Taniguchi, Akira, Fukawa, Ayako, Yamakawa, Hiroshi
We constructed a hippocampal formation (HPF)-inspired probabilistic generative model (HPF-PGM) using the structure-constrained interface decomposition method. By modeling brain regions with PGMs, this model is positioned as a module that can be integrated as a whole-brain PGM. We discuss the relationship between simultaneous localization and mapping (SLAM) in robotics and the findings of HPF in neuroscience. Furthermore, we survey the modeling for HPF and various computational models, including brain-inspired SLAM, spatial concept formation, and deep generative models. The HPF-PGM is a computational model that is highly consistent with the anatomical structure and functions of the HPF, in contrast to typical conventional SLAM models. By referencing the brain, we suggest the importance of the integration of egocentric/allocentric information from the entorhinal cortex to the hippocampus and the use of discrete-event queues.
The RLR-Tree: A Reinforcement Learning Based R-Tree for Spatial Data
Gu, Tu, Feng, Kaiyu, Cong, Gao, Long, Cheng, Wang, Zheng, Wang, Sheng
Despite the success of these learned indices in improving the performance Learned indices have been proposed to replace classic index structures of some types of queries, they still have various limitations, like B-Tree with machine learning (ML) models. They require e.g., they can only handle spatial point objects and limited types to replace both the indices and query processing algorithms currently of spatial queries, some only return approximate query results, deployed by the databases, and such a radical departure is and they either cannot handle updates or need a periodic rebuild likely to encounter challenges and obstacles. In contrast, we propose to retain high query efficiency (Detailed discussions are in Section a fundamentally different way of using ML techniques to 2). These limitations, together with the requirement that the improve on the query performance of the classic R-Tree without learned indices need a replacement of the index structures and the need of changing its structure or query processing algorithms.
Interview with Konstantin Klemmer – talking Climate Change AI and geographic data research
Konstantin Klemmer is a PhD student at the University of Warwick working at the intersection of machine learning and geographic data. He also serves as the Communications Chair for Climate Change AI. We talked about his research and the Climate Change AI organisation. Climate Change AI (CCAI) is a volunteer run organisation that catalyses impactful work at the intersection of climate change and machine learning by providing education and infrastructure, building a community, and advancing discourse. We also run a forum and regular community events like our fortnightly happy hour.
Learning Large-scale Location Embedding From Human Mobility Trajectories with Graphs
Tian, Chenyu, Zhang, Yuchun, Weng, Zefeng
GPS coordinates and other location indicators are fine-grained location indicators that are difficult to be effectively utilized by machine learning models in Geo-aware applications. Previous location embedding methods are mostly tailored for specific problems that are taken place within areas of interest. When it comes to the scale of the entire cities, existing approaches always suffer from extensive computational cost and signigicant information loss. An increasing amount of location-based service (LBS) data are being accumulated and released to the public and enables us to study urban dynamics and human mobility. This study learns vector representations for locations using the large-scale LBS data. Different from existing studies, we propose to consider both spatial connection and human mobility, and jointly learn the representations from a flow graph and a spatial graph through a GCN-aided skip-gram model named GCN-L2V. This model embeds context information in human mobility and spatial information. By doing so, GCN-L2V is able to capture relationships among locations and provide a better notion of semantic similarity in a spatial environment. Across quantitative experiments and case studies, we empirically demonstrate that the representations learned by GCN-L2V are effective. GCN-L2V can be applied in a complementary manner to other place embedding methods and down-streaming Geo-aware applications.
NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting
Chen, Kai, Chen, Guang, Xu, Dan, Zhang, Lijun, Huang, Yuyao, Knoll, Alois
Although Transformer has made breakthrough success in widespread domains especially in Natural Language Processing (NLP), applying it to time series forecasting is still a great challenge. In time series forecasting, the autoregressive decoding of canonical Transformer models could introduce huge accumulative errors inevitably. Besides, utilizing Transformer to deal with spatial-temporal dependencies in the problem still faces tough difficulties.~To tackle these limitations, this work is the first attempt to propose a Non-Autoregressive Transformer architecture for time series forecasting, aiming at overcoming the time delay and accumulative error issues in the canonical Transformer. Moreover, we present a novel spatial-temporal attention mechanism, building a bridge by a learned temporal influence map to fill the gaps between the spatial and temporal attention, so that spatial and temporal dependencies can be processed integrally. Empirically, we evaluate our model on diversified ego-centric future localization datasets and demonstrate state-of-the-art performance on both real-time and accuracy.
Inferring spatial relations from textual descriptions of images
Elu, Aitzol, Azkune, Gorka, de Lacalle, Oier Lopez, Arganda-Carreras, Ignacio, Soroa, Aitor, Agirre, Eneko
Generating an image from its textual description requires both a certain level of language understanding and common sense knowledge about the spatial relations of the physical entities being described. In this work, we focus on inferring the spatial relation between entities, a key step in the process of composing scenes based on text. More specifically, given a caption containing a mention to a subject and the location and size of the bounding box of that subject, our goal is to predict the location and size of an object mentioned in the caption. Previous work did not use the caption text information, but a manually provided relation holding between the subject and the object. In fact, the used evaluation datasets contain manually annotated ontological triplets but no captions, making the exercise unrealistic: a manual step was required; and systems did not leverage the richer information in captions. Here we present a system that uses the full caption, and Relations in Captions (REC-COCO), a dataset derived from MS-COCO which allows to evaluate spatial relation inference from captions directly. Our experiments show that: (1) it is possible to infer the size and location of an object with respect to a given subject directly from the caption; (2) the use of full text allows to place the object better than using a manually annotated relation. Our work paves the way for systems that, given a caption, decide which entities need to be depicted and their respective location and sizes, in order to then generate the final image.
Solar Radiation Anomaly Events Modeling Using Spatial-Temporal Mutually Interactive Processes
Zhang, Minghe, Xu, Chen, Sun, Andy, Qiu, Feng, Xie, Yao
Solar power installations are becoming common in residential and commercial areas, largely due to their decreasing costs. However, the power system is vulnerable to some anomalies such as rainstorm or hurricane, which cost greatly to restoration. As a result, detecting and predicting abnormal events from the spatialtemporal series plays a vital role in the solar system, aiming to capture the variety of intrinsic reasons for the anomalies. For example, the rainstorm and drought would bring out different types and patterns of anomalies. In many cases, the abnormal event will also start at one location and then propagate to its neighbors with a time delay, leading to spatial-temporal correlation among anomalies. Thus it is crucial to make observations at multiple locations, which correspondingly form the spatial-temporal series. In this paper, we address non-stationarity and strong spatial-temporal correlation through the following contributions: - Strong spatial-temporal correlation: We present a spatial-temporal Bernoulli process (also extended to categorical observations), which is proposed by [19]. The model can flexibly capture the spatial-temporal correlations and interactions without assuming time-decaying influence. It can also efficiently make predictions for any location at any future time for timely ramp event detection.
Modeling massive multivariate spatial data with the basis graphical lasso
Krock, Mitchell, Kleiber, William, Hammerling, Dorit, Becker, Stephen
We propose a new modeling framework for highly multivariate spatial processes that synthesizes ideas from recent multiscale and spectral approaches with graphical models. The basis graphical lasso writes a univariate Gaussian process as a linear combination of basis functions weighted with entries of a Gaussian graphical vector whose graph is estimated from optimizing an $\ell_1$ penalized likelihood. This paper extends the setting to a multivariate Gaussian process where the basis functions are weighted with Gaussian graphical vectors. We motivate a model where the basis functions represent different levels of resolution and the graphical vectors for each level are assumed to be independent. Using an orthogonal basis grants linear complexity and memory usage in the number of spatial locations, the number of basis functions, and the number of realizations. An additional fusion penalty encourages a parsimonious conditional independence structure in the multilevel graphical model. We illustrate our method on a large climate ensemble from the National Center for Atmospheric Research's Community Atmosphere Model that involves 40 spatial processes.