Just 10 years ago, most application development testing strategies focused on unit testing for validating business logic, manual test cases to certify user experiences, and separate load testing scripts to confirm performance and scalability. The development and release of features were relatively slow compared to today's development capabilities built on cloud infrastructure, microservice architectures, continuous integration and continuous delivery (CI/CD) automations, and continuous testing capabilities. Furthermore, many applications are developed today by configuring software as a service (SaaS) or building low-code and no-code applications that also require testing the underlying business flows and processes. Agile development teams in devops organizations aim to reduce feature cycle time, increase delivery frequencies, and ensure high-quality user experiences. The question is, how can they reduce risks and shift-left testing without creating new testing complexities, deployment bottlenecks, security gaps, or significant cost increases?
Just when you thought it couldn't grow any more explosively, the data/AI landscape just did: the rapid pace of company creation, exciting new product and project launches, a deluge of VC financings, unicorn creation, IPOs, etc. It has also been a year of multiple threads and stories intertwining. One story has been the maturation of the ecosystem, with market leaders reaching large scale and ramping up their ambitions for global market domination, in particular through increasingly broad product offerings. Some of those companies, such as Snowflake, have been thriving in public markets (see our MAD Public Company Index), and a number of others (Databricks, Dataiku, DataRobot, etc.) have raised very large (or in the case of Databricks, gigantic) rounds at multi-billion valuations and are knocking on the IPO door (see our Emerging MAD company Index). But at the other end of the spectrum, this year has also seen the rapid emergence of a whole new generation of data and ML startups. Whether they were founded a few years or a few months ago, many experienced a growth spurt in the past year or so. Part of it is due to a rabid VC funding environment and part of it, more fundamentally, is due to inflection points in the market. In the past year, there's been less headline-grabbing discussion of futuristic applications of AI (self-driving vehicles, etc.), and a bit less AI hype as a result. Regardless, data and ML/AI-driven application companies have continued to thrive, particularly those focused on enterprise use trend cases. Meanwhile, a lot of the action has been happening behind the scenes on the data and ML infrastructure side, with entirely new categories (data observability, reverse ETL, metrics stores, etc.) appearing or drastically accelerating. To keep track of this evolution, this is our eighth annual landscape and "state of the union" of the data and AI ecosystem -- coauthored this year with my FirstMark colleague John Wu. (For anyone interested, here are the prior versions: 2012, 2014, 2016, 2017, 2018, 2019: Part I and Part II, and 2020.) For those who have remarked over the years how insanely busy the chart is, you'll love our new acronym: Machine learning, Artificial intelligence, and Data (MAD) -- this is now officially the MAD landscape! We've learned over the years that those posts are read by a broad group of people, so we have tried to provide a little bit for everyone -- a macro view that will hopefully be interesting and approachable to most, and then a slightly more granular overview of trends in data infrastructure and ML/AI for people with a deeper familiarity with the industry. Let's start with a high-level view of the market. As the number of companies in the space keeps increasing every year, the inevitable questions are: Why is this happening? How long can it keep going?
Artificial intelligence (AI) is set to change how the world works. Although it's not perfect, artificial intelligence is a gamer changer. AI is the main engine of the digital revolution. The COVID-19 crisis has accelerated the need for human-machine digital intelligent platforms facilitating new knowledge, competences and workforce skills, advanced cognitive, scientific, technological, and engineering, social, and emotional skills. In the AI and Robotics era, there is a high demand for the scientific knowledge, digital competence, and high-technology training in a range of innovative areas of exponential technologies, such as artificial intelligence, machine learning and robotics, data science and big data, cloud and edge computing, the Internet of Thing, 5G, cybersecurity and digital reality.
The emergence of data science as a field of study and practical application over the last century has led to the development of technologies such as deep learning, natural language processing, and computer vision. Broadly speaking, it has enabled the emergence of machine learning (ML) as a way of working towards what we refer to as artificial intelligence (AI), a field of technology that's rapidly transforming the way we work and live. Data science encompasses the theoretical and practical application of ideas, including Big Data, predictive analytics, and artificial intelligence. If data is the oil of the information age and ML is the engine, then data science is the digital domain's equivalent of the laws of physics that cause combustion to occur and pistons to move. A key point to remember is that as the importance of understanding how to work with data grows, the science behind it is becoming more accessible.
A global KPMG survey showed that organizations cannot yet reap all the benefits of data analytics due to data quality issues and a lack of capable resources. In 30 years' time, developments in data analytics itself could solve this issue, making many current professions in the sector obsolete. The impossible will become possible, and this may well lead to an autonomous decision-making process. Data analytics is expected to radically change the way we live and do business in the future. Already today we use the analytics in our technology devices, for many decisions in our lives.
The pandemic time impacted many aspects of both our private and professional lives. Without any doubt we all faced significant challenges while required immediately to move fully into remote working. We stopped working from the office, meeting our colleagues and clients face to face and stopped travelling. That sudden shift changed not only the way we work, but also changed all our daily activities both business and private ones and significantly reduced our social interactions in the real-world, impacting also our mental sphere. And probably not everyone was taking enough care about the right work-life-balance and activities to keep both physical and mental health.
Bibliographic and co-citation coupling are two analytical methods widely used to measure the degree of similarity between scientific papers. These approaches are intuitive, easy to put into practice, and computationally cheap. Moreover, they have been used to generate a map of science, allowing visualizing research field interactions. Nonetheless, these methods do not work unless two papers share a standard reference, limiting the two papers usability with no direct connection. In this work, we propose to extend bibliographic coupling to the deep neighborhood, by using graph diffusion methods. This method allows defining similarity between any two papers, making it possible to generate a local map of science, highlighting field organization.
Snowflake is a cloud data warehouse provided as a software-as-a-service (SaaS). It consists of unique architecture to handle multiple aspects of data and analytics. Snowflake sets itself apart from all other traditional data warehouse solutions with advanced capabilities like improved performance, simplicity, high concurrency and cost-effectiveness. Snowflake's shared data architecture physically separates the computation and storage which is not possible by the traditional offerings. It streamlines the process for businesses to store and analyze massive volumes of data using cloud-based tools.
Getting the software right is important when developing machine learning models, such as recommendation or classification systems. But at eBay, optimizing the software to run on a particular piece of hardware using distillation and quantization techniques was absolutely essential to ensure scalability. "[I]n order to build a truly global marketplace that is driven by state of the art and powerful and scalable AI services," Kopru said, "you have to do a lot of optimizations after model training, and specifically for the target hardware." With 1.5 billion active listings from more than 19 million active sellers trying to reach 159 million active buyers, the ecommerce giant has a global reach that is matched by only a handful of firms. Machine learning and other AI techniques, such as natural language processing (NLP), play big roles in scaling eBay's operations to reach its massive audience. For instance, automatically generated descriptions of product listings is crucial for displaying information on the small screens of smart phones, Kopru said.
A key part of the NLP ethics movement is responsible use of data, but exactly what that means or how it can be best achieved remain unclear. This position paper discusses the core legal and ethical principles for collection and sharing of textual data, and the tensions between them. We propose a potential checklist for responsible data (re-)use that could both standardise the peer review of conference submissions, as well as enable a more in-depth view of published research across the community. Our proposal aims to contribute to the development of a consistent standard for data (re-)use, embraced across NLP conferences.