Collaborating Authors


Training a recommender model of 100 trillions parameters on Google Cloud


A recommender system is an important component of Internet services today: billion dollar revenue businesses are directly driven by recommendation services at big tech companies. The current landscape of production recommender systems is dominated by deep learning based approaches, where an embedding layer is first adopted to map extremely large-scale ID type features to fixed-length embedding vectors; then the embeddings are leveraged by complicated neural network architectures to generate recommendations. The continuing advancement of recommender models is often driven by increasing model sizes--several models have been previously released with billion parameters up to even trillion very recently. Every jump in the model capacity has brought in significant improvement on quality. The era of 100 trillion parameters is just around the corner.

Computer Vision Pipeline with Kubernetes


We produce a multitude of attributes (characteristics attached to an entity -- building, parcel, etc.) using various sources such as aerial imagery. The idea is to build Deep Learning models from a few thousand buildings using in-house-tagged labels or existing labels from open data. In a second step, the models are deployed on the whole French territory, which represents more than 35 million images to process (i.e. 4 TB of data to deal with). This second step is the focus of this post. The challenge is to be able to infer at low cost and in a short amount of time, (less than a day).

Jetson Mate: A Compact Carrier Board for Jetson Nano/NX System-on-Modules


Containers have become the unit of deployment not just for data center and cloud workloads but also for edge applications. Along with containers, Kubernetes has become the foundation of the infrastructure. Distributions such as K3s are fueling the adoption of Kubernetes at the edge. I have seen many challenges when working with large retailers and system integrators rolling out Kubernetes-based edge infrastructure. One of them is the ability to mix and match ARM64 and AMD64 devices to run AI workloads.

Deploy machine learning models on Google Cloud AI Platform


My Course is meant for anyone who already knows how to build both machine and deep learning models that is interested in deploying them easily on Google Cloud AI Platform. So that they can send the deployed models post requests. Also you must be familiar with Natural Language Processing and some basic cloud concepts. I will explain everything in the videos. But most importantly you do not need to be an expert in python to do this.

TensorFlow and the Google Cloud ML Engine for Deep Learning


TensorFlow is quickly becoming the technology of choice for deep learning, because of how easy TF makes it to build powerful and sophisticated neural networks. The Google Cloud Platform is a great place to run TF models at scale, and perform distributed training and prediction. This is a comprehensive, from-the-basics course on TensorFlow and building neural networks. It assumes no prior knowledge of Tensorflow, all you need to know is basic Python programming.

KServe: A Robust and Extensible Cloud Native Model Server


If you are familiar with Kubeflow, you know KFServing as the platform's model server and inference engine. In September last year, the KFServing project has gone through a transformation to become KServe. KServe is now an independent component graduating from the Kubeflow project, apart from the name change. The separation allows KServe to evolve as a separate, cloud native inference engine deployed as a standalone model server. Of course, it will continue to have tight integration with Kubeflow, but they would be treated and maintained as independent open source projects.

Maple Leaf Sports & Entertainment Selects AWS as Cloud Provider


… provider and provider of artificial intelligence (AI), machine learning (ML), and deep learning cloud services, according to AWS last month.

PyTorch on Google Cloud: Blog series recap


PyTorch is an open source machine learning framework, primarily developed by Meta (previously Facebook). PyTorch is extensively used in the research space and in recent years it has gained immense traction in the industry due to its ease of use and deployment. Vertex AI, a fully managed end-to-end data science and machine learning platform on Google Cloud, has first class support for PyTorch making it optimized, compatibility tested and ready to deploy. We started a new blog series - PyTorch on Google Cloud - to uncover, demonstrate and share how to build, train and deploy PyTorch models at scale on Cloud AI Infrastructure using GPUs and TPUs on Vertex AI, and how to create reproducible machine learning pipelines on Google Cloud . This blog post is the home page to the series with links to the existing and upcoming posts for the readers to refer to.

Reinforcement Learning-Empowered Mobile Edge Computing for 6G Edge Intelligence Artificial Intelligence

Mobile edge computing (MEC) is considered a novel paradigm for computation-intensive and delay-sensitive tasks in fifth generation (5G) networks and beyond. However, its uncertainty, referred to as dynamic and randomness, from the mobile device, wireless channel, and edge network sides, results in high-dimensional, nonconvex, nonlinear, and NP-hard optimization problems. Thanks to the evolved reinforcement learning (RL), upon iteratively interacting with the dynamic and random environment, its trained agent can intelligently obtain the optimal policy in MEC. Furthermore, its evolved versions, such as deep RL (DRL), can achieve higher convergence speed efficiency and learning accuracy based on the parametric approximation for the large-scale state-action space. This paper provides a comprehensive research review on RL-enabled MEC and offers insight for development in this area. More importantly, associated with free mobility, dynamic channels, and distributed services, the MEC challenges that can be solved by different kinds of RL algorithms are identified, followed by how they can be solved by RL solutions in diverse mobile applications. Finally, the open challenges are discussed to provide helpful guidance for future research in RL training and learning MEC.

GitHub - deepmind/xmanager


XManager is a platform for packaging, running and keeping track of machine learning experiments. It currently enables one to launch experiments locally or on Google Cloud Platform (GCP). Interaction with experiments is done via XManager's APIs through Python launch scripts. To get started, install XManager, its prerequisites if needed and follow the tutorial or codelab.ipynb to create and run a launch script. Or, alternatively, a PyPI project is also available.