Goto

Collaborating Authors

 halide


How I Fell Back in Love with iPhone Photography

The New Yorker

There's a Japanese word, komorebi, that describes beams of light and dappled shadows that result when the sun shines through trees. When I take my dog on walks around my leafy neighborhood in Washington, D.C., komorebi is what most often catches my eye, especially in this autumnal moment when dense, green summer foliage is starting to thin and turn golden. As the sun sets and the shadows grow long on the edge of a precipitous valley near my apartment, the foliage creates fluttering patterns of warm and cool colors. I try to photograph these apparitions with my iPhone camera, but I'm always disappointed in the results: the device's automated image processing treats contrast as a problem to be solved, aggressively darkening the highlights and lightening up the shadows to achieve a bland flatness. Little of the lambent atmosphere I see in real life survives in the image.


Machine Learning guided high-throughput search of non-oxide garnets

Schmidt, Jonathan, Wang, Haichen, Schmidt, Georg, Marques, Miguel

arXiv.org Artificial Intelligence

Garnets, known since the early stages of human civilization, have found important applications in modern technologies including magnetorestriction, spintronics, lithium batteries, etc. The overwhelming majority of experimentally known garnets are oxides, while explorations (experimental or theoretical) for the rest of the chemical space have been limited in scope. A key issue is that the garnet structure has a large primitive unit cell, requiring an enormous amount of computational resources. To perform a comprehensive search of the complete chemical space for new garnets,we combine recent progress in graph neural networks with high-throughput calculations. We apply the machine learning model to identify the potential (meta-)stable garnet systems before systematic density-functional calculations to validate the predictions. In this way, we discover more than 600 ternary garnets with distances to the convex hull below 100~meV/atom with a variety of physical and chemical properties. This includes sulfide, nitride and halide garnets. For these, we analyze the electronic structure and discuss the connection between the value of the electronic band gap and charge balance.


Announcing Tensor Comprehensions

#artificialintelligence

Today, Facebook AI Research (FAIR) is announcing the release of Tensor Comprehensions, a C library and mathematical language that helps bridge the gap between researchers, who communicate in terms of mathematical operations, and engineers focusing on the practical needs of running large-scale models on various hardware backends. The main differentiating feature of Tensor Comprehensions is that it represents a unique take on Just-In-Time compilation to produce the high-performance codes that the machine learning community needs, automatically and on-demand. As a consequence and over the last few years, the deep learning community has grown to rely on high-performance libraries such as CuBLAS, MKL, and CuDNN to get high-performance code on GPUs and CPUs. Experimenting with ideas that deviate from the primitives provided in these libraries involves a level and magnitude of engineering that can be intimidating to researchers. We anticipate great practical value in open-sourcing a package that shortens this process from days or weeks to minutes.


Technical Perspective: Can High Performance be Portable?

Communications of the ACM

The development of high-performance software has always suffered from a tension between achieving high performance on the one hand and portability and simplicity on the other hand. By specializing an algorithm for optimal performance, considering the memory hierarchy and other architectural particulars, we introduce architecture-specific detail. This obscures algorithmic structure and conflates the general with the specific, compromising simplicity and clarity. It also hurts portability to all but very similar architectures--simple changes, such as different cache sizes, can have substantial performance implications. Moreover, distinctly different architectures, such as CPUs versus GPUs versus DSPs, often require fundamentally different optimization strategies.


Halide

Communications of the ACM

Writing high-performance code on modern machines requires not just locally optimizing inner loops, but globally reorganizing computations to exploit parallelism and locality--doing things such as tiling and blocking whole pipelines to fit in cache. This is especially true for image processing pipelines, where individual stages do much too little work to amortize the cost of loading and storing results to and from off-chip memory. As a result, the performance difference between a naive implementation of a pipeline and one globally optimized for parallelism and locality is often an order of magnitude. However, using existing programming tools, writing high-performance image processing code requires sacrificing simplicity, portability, and modularity. We argue that this is because traditional programming models conflate the computations defining the algorithm with decisions about intermediate storage and the order of computation, which we call the schedule. We propose a new programming language for image processing pipelines, called Halide, that separates the algorithm from its schedule. Programmers can change the schedule to express many possible organizations of a single algorithm. The Halide compiler then synthesizes a globally combined loop nest for an entire algorithm, given a schedule. Halide models a space of schedules which is expressive enough to describe organizations that match or outperform state-of-the-art hand-written implementations of many computational photography and computer vision algorithms. Its model is simple enough to do so often in only a few lines of code, and small changes generate efficient implementations for x86, ARM, Graphics Processors (GPUs), and specialized image processors, all from a single algorithm. Halide has been public and open source for over four years, during which it has been used by hundreds of programmers to deploy code to tens of thousands of servers and hundreds of millions of phones, processing billions of images every day. Computational photography and computer vision algorithms require highly efficient implementations to be used in practice, from power-constrained mobile devices to data centers processing billions of images. This is not a simple matter of programming in a low-level language such as C: even in C, the performance difference between naïve and highly optimized image processing code for the same algorithm is often an order of magnitude. Unfortunately, optimization usually comes at a large cost in programmer time and code complexity, as computation must be globally reorganized to efficiently exploit the memory system (locality, e.g., in caches) and many execution units (parallelism, e.g., across threads and vector lanes). Image processing pipelines are both wide and deep: they consist of many data-parallel stages that benefit hugely from parallel execution across pixels, but stages are often memory bandwidth limited--they do little work per load and store.