Results


TensorFlow* Optimizations on Modern Intel Architecture

@machinelearnbot

TensorFlow* is a leading deep learning and machine learning framework, which makes it important for Intel and Google to ensure that it is able to extract maximum performance from Intel's hardware offering. This paper introduces the Artificial Intelligence (AI) community to TensorFlow optimizations on Intel Xeon and Intel Xeon Phi processor-based platforms. These optimizations are the fruit of a close collaboration between Intel and Google engineers announced last year by Intel's Diane Bryant and Google's Diane Green at the first Intel AI Day. We describe the various performance challenges that we encountered during this optimization exercise and the solutions adopted. We also report out performance improvements on a sample of common neural networks models.


New Optimizations Improve Deep Learning Frameworks For CPUs

#artificialintelligence

Since most of us need more than a "machine learning only" server, I'll focus on the reality of how Intel Xeon SP Platinum processors remain the best choice for servers, including servers needing to do machine learning as part of their workload. Here is a partial run down of key software for accelerating deep learning on Intel Xeon Platinum processor versions enough that the best performance advantage of GPUs is closer to 2X than to 100X. There is also a good article in Parallel Universe Magazine, Issue 28, starting on page 26, titled Solving Real-World Machine Learning Problems with Intel Data Analytics Acceleration Library. High-core count CPUs (the Intel Xeon Phi processors – in particular the upcoming "Knights Mill" version), and FPGAs (Intel Xeon processors coupled with Intel/Altera FPGAs), offer highly flexible options excellent price/performance and power efficiencies.


New Optimizations Improve Deep Learning Frameworks For CPUs

#artificialintelligence

Since most of us need more than a "machine learning only" server, I'll focus on the reality of how Intel Xeon SP Platinum processors remain the best choice for servers, including servers needing to do machine learning as part of their workload. Here is a partial run down of key software for accelerating deep learning on Intel Xeon Platinum processor versions enough that the best performance advantage of GPUs is closer to 2X than to 100X. There is also a good article in Parallel Universe Magazine, Issue 28, starting on page 26, titled Solving Real-World Machine Learning Problems with Intel Data Analytics Acceleration Library. High-core count CPUs (the Intel Xeon Phi processors – in particular the upcoming "Knights Mill" version), and FPGAs (Intel Xeon processors coupled with Intel/Altera FPGAs), offer highly flexible options excellent price/performance and power efficiencies.


AWS AI Blog

#artificialintelligence

Second, framework developers need to maintain multiple backends to guarantee performance on hardware ranging from smartphone chips to data center GPUs. Diverse AI frameworks and hardware bring huge benefits to users, but it is very challenging to AI developers to deliver consistent results to end users. Motivated by the compiler technology, a group of researchers including Tianqi Chen, Thierry Moreau, Haichen Shen, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy from Paul G. Allen School of Computer Science & Engineering, University of Washington, together with Ziheng Jiang from the AWS AI team, introduced the TVM stack to simplify this problem. Today, AWS is excited to announce, together with the research team from UW, an end-to-end compiler based on the TVM stack that compiles workloads directly from various deep learning frontends into optimized machine codes.


A load balancer that learns, WebTorch – UnifyID – Medium

#artificialintelligence

In my previous blog post "How I stopped worrying and embraced docker microservices" I talked about why Microservices are the bees knees for scaling Machine Learning in production. If only there was a tool that made this decision easy and allowed you to even go to the extreme case of writing a monolith, without sacrificing either HTTP performance (and pretty HTTP server semantics) or ML performance and relevance in the rapid growing Deep Learning market. WebTorch is the freak child of the fastest, most stable HTTP server, nginx and the fastest, most relevant Deep Learning framework Torch. Now of course that doesn't mean WebTorch is either the best performance HTTP server and/or the best performing Deep Learning framework, but it's at least worth a look right?


5 Reasons Why Your Data Science Team Needs The DGX Station

#artificialintelligence

I immediately pulled a container and started work on a CNTK NCCL project, the next day pulled another container to work on a TF biomedical project. By running Nvidia Optix 5.0 on a DGX Station, content creators can significantly accelerate training, inference and rendering (meaning both AI and graphics tasks). Flexibility to do AI work at the desk, data center, or edge The Fastest Personal Supercomputer for Researchers and Data Scientists 15. www.nvidia.com/dgx-station However, for our current projects we need a compute server that we have exclusive access to." By running Nvidia Optix 5.0 on a DGX Station, content creators can significantly accelerate training, inference and rendering (meaning both AI and graphics tasks).


Search for the fastest Deep Learning Framework supported by Keras

@machinelearnbot

Currently the official Keras release already supports Google's TensorFlow and Microsoft's CNTK deep learning libraries besides supporting other popular libraries like Theano. Keras also enables developers to quickly test relative performance across multiple supported deep learning frameworks. This is because MXNet doesn't yet support newer Keras functions and scripts would have needed significant changes before running on MXNet. In a standard Deep neural network test using MNIST dataset, CNTK, TensorFlow and Theano achieve similar scores (2.5 – 2.7 s/epoch) but MXNet blows it out of the water with 1.4s/epoch timing.


Get Started with AI

#artificialintelligence

Rely on the Intel Nervana AI Academy to help you increase your knowledge base and put machine learning to use quickly, efficiently, and cost-effectively on Intel architecture. In this webinar, we continue our exploration of deep learning topics including multilayer perceptron, convolutional neural networks, recurrent neural networks, cost functions, and back propogation. See practical examples and discover new opportunities to apply AI in the real world. See practical examples and discover new opportunities to apply AI in the real world.


PyTorch or TensorFlow?

@machinelearnbot

PyTorch is essentially a GPU enabled drop-in replacement for NumPy equipped with higher-level functionality for building and training deep neural networks. In PyTorch the graph construction is dynamic, meaning the graph is built at run-time. TensorFlow does have thedynamic_rnn for the more common constructs but creating custom dynamic computations is more difficult. I haven't found the tools for data loading in TensorFlow (readers, queues, queue runners, etc.)


Introducing Social Hash Partitioner, a scalable distributed hypergraph partitioner

#artificialintelligence

As a single host has limited storage and compute resources, our storage systems shard data items over multiple hosts and our batch jobs execute over clusters of thousands of workers, to scale and speed-up the computation. Our VLDB'17 paper, Social Hash Partitioner: A Scalable Distributed Hypergraph Partitioner, describes a new method for partitioning bipartite graphs while minimizing fan-out. We describe the resulting framework as a Social Hash Partitioner (SHP) because it can be used as the hypergraph partitioning component of the Social Hash framework introduced in our earlier NSDI'16 paper. The fan-out reduction model is applicable to many infrastructure optimization problems at Facebook, like data sharding, query routing and index compression.