zaharia
Simplifying Data And AI With a Data Lakehouse
Why do so many organizations find it difficult to leverage the power of data analytics and AI? According to Matei Zaharia, the cofounder and chief technologist at Databricks, the reason is not that data-related problems are intrinsically hard, but that the technology infrastructure that businesses build to manage their data is often more complicated than it needs to be. For the uninitiated, Zaharia started the Apache Spark project during his PhD at UC Berkeley in 2009 before founding Databricks, and today is also an assistant professor of Computer Science at Stanford. He was in town at the STACK conference in November to share his insights about the future of data and the role of the data lakehouse. To illustrate the benefits of data, Zaharia started with a diagram to illustrate data and AI maturity against the competitive advantages that businesses can expect to gain.
Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers
Tam, Weng Lam, Liu, Xiao, Ji, Kaixuan, Xue, Lilong, Zhang, Xingjian, Dong, Yuxiao, Liu, Jiahua, Hu, Maodi, Tang, Jie
Prompt tuning attempts to update few task-specific parameters in pre-trained models. It has achieved comparable performance to fine-tuning of the full parameter set on both language understanding and generation tasks. In this work, we study the problem of prompt tuning for neural text retrievers. We introduce parameter-efficient prompt tuning for text retrieval across in-domain, cross-domain, and cross-topic settings. Through an extensive analysis, we show that the strategy can mitigate the two issues -- parameter-inefficiency and weak generalizability -- faced by fine-tuning based retrieval methods. Notably, it can significantly improve the out-of-domain zero-shot generalization of the retrieval models. By updating only 0.1% of the model parameters, the prompt tuning strategy can help retrieval models achieve better generalization performance than traditional methods in which all parameters are updated. Finally, to facilitate research on retrievers' cross-topic generalizability, we curate and release an academic retrieval dataset with 18K query-results pairs in 87 topics, making it the largest topic-specific one to date.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (11 more...)
Accidental Billionaires: How Seven Academics Who Didn't Want To Make A Cent Are Now Worth Billions
Inside a 13th-floor boardroom in downtown San Francisco, the atmosphere was tense. It was November 2015, and Databricks, a two-year-old software company started by a group of seven Berkeley researchers, was long on buzz but short on revenue. The directors awkwardly broached subjects that had been rehashed time and again. The startup had been trying to raise funds for five months, but venture capitalists were keeping it at arm's length, wary of its paltry sales. Seeing no other option, NEA partner Pete Sonsini, an existing investor, raised his hand to save the company with an emergency $30 million injection. Founding CEO Ion Stoica had agreed to step aside and return to his professorship at the University of California, Berkeley. The obvious move was to bring in a seasoned Silicon Valley executive, which is exactly what Databricks' chief competitor Snowflake did twice on its way to a software-record $33 billion IPO in September 2020.
- North America > United States > California > San Francisco County > San Francisco (0.25)
- North America > United States > California > Alameda County > Berkeley (0.24)
Visiting the SOSP 2019 AI System Workshop
The ACM Symposium on Operating Systems Principles (SOSP) has a long history and a great reputation in Operating Systems (OS) research. This year SOSP was held in Huntsville, a charming town located in lake country, some 200km north of Toronto. On a rainy Sunday, Synced visited Huntsville to check out the SOSP AI System Workshop. The growing and widespread deployment of AI has motivated OS researchers to develop novel system engineering for AI. The SOSP AI System Workshop explored these efforts to advance research in AI and operating systems.
- North America > Canada > Ontario > Toronto (0.25)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > China (0.04)
MLflow Opens Up to R
Data scientists who work within the R environment can now partake of MLflow, the open source project that Databricks released earlier this year to help manage workflows associated with machine learning development and production lifecycles. In June, Databricks co-founder and CTO Matei Zaharia unveiled MLflow as a way to automate much of the work that data scientists do when building, testing, and deploying machine learning models. The open source software was designed to fill in the gaps between the various tools, frameworks, and processes when building machine learning systems, including tracking code, packaging models, and deploying them into production. According to Databricks, MLflow allows users to package their code as reproducible runs, execute and compare hundreds of parallel experiments, on any hardware or software platform, including on premise and cloud based environments. Assistance with hyperparameter tuning is also provided.
- North America > United States > Massachusetts > Suffolk County > Boston (0.06)
- Europe (0.06)
Databricks Open Sources MLflow to Simplify Machine Learning Lifecycle
Databricks today unveiled MLflow, a new open source project that aims to provide some standardization to the complex processes that data scientists oversee during the course of building, testing, and deploying machine learning models. "Everybody who has done machine learning knows that the machine learning development lifecycle is very complex," Apache Spark creator and Databricks CTO Matei Zaharia said during his keynote address at Databricks' Spark and AI Summit in San Francisco. "There are a lot of issues that come up that you don't have in normal software development lifecycle." The vast volumes of data, together with the abundance of machine learning frameworks, the large scale of production systems, and the distributed nature of data science and engineering teams, combine to provide a huge number of variables to control in the machine learning DevOps lifecycle -- and that even before the tuning. "They have all these tuning parameters that you have to change and explore to get a good model," Zaharia said.
Meet Ray, the Real-Time Machine-Learning Replacement for Spark
Researchers at UC Berkeley's RISELab have developed a new distributed framework designed to enable Python-based machine learning and deep learning workloads to execute in real-time with MPI-like power and granularity. Called Ray, the framework is ostensibly a replacement for Spark, which is seen as too slow for some real-world AI applications, and should be ready for production use in less than a year. Ray is one of the first technologies to emerge from RISELab, the research group at Berkeley that followed highly successful AMPLab, which generated a host of compelling distributed technologies that have impacted the field of high performance and enterprise computing alike, including Spark, Mesos, Tachyon, and others. One of the advisors for the old AMPLab and the current RISELab, Computer Science Professor Michael Jordan, discussed the core principles and drivers behind Ray during the recent Strata Hadoop World conference in San Jose, California. "Spark was developed because my students were complaining about Hadoop," Jordan said during a keynote address on March 16.
- Asia > Middle East > Jordan (0.34)
- North America > United States > California > Santa Clara County > San Jose (0.25)