
Matei Zaharia
researcherco-founder at Databricks
USA
Data / AI Infra. co-founder at Databricks.
50 papers found
Identification of cardiac wall motion abnormalities in diverse populations by deep learning of the electrocardiogram
npj Digital Medicine20255 citations
ColBERT-Serve: Efficient Multi-stage Memory-Mapped Scoring
Lecture notes in computer science20251 citations
Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS
Proceedings of the VLDB Endowment20251 citations
Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
arXiv (Cornell University)20245 citations
RAFT: Adapting Language Model to Domain Specific RAG
arXiv (Cornell University)202426 citations
Long Context RAG Performance of Large Language Models
arXiv (Cornell University)20245 citations
Adaptive and Robust Query Execution for Lakehouses at Scale
Proceedings of the VLDB Endowment20249 citations
Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
arXiv (Cornell University)20243 citations
Semantic Operators: A Declarative Model for Rich, AI-based Data Processing
arXiv (Cornell University)20249 citations
Optimizing LLM Queries in Relational Data Analytics Workloads
arXiv (Cornell University)20246 citations
Image and data mining in reticular chemistry powered by GPT-4V
Digital Discovery202450 citations
How Is ChatGPT’s Behavior Changing Over Time?
Harvard Data Science Review2024245 citations
ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data
arXiv (Cornell University)20243 citations
Text2SQL is Not Enough: Unifying AI and Databases with TAG
arXiv (Cornell University)20243 citations
World Model on Million-Length Video And Language With Blockwise RingAttention
arXiv (Cornell University)202411 citations
Specifications: The missing link to making the development of LLM systems an engineering discipline
arXiv (Cornell University)20242 citations
Data Management for ML-Based Analytics and Beyond
ACM / IMS Journal of Data Science20243 citations
Drowning in Documents: Consequences of Scaling Reranker Inference
arXiv (Cornell University)2024
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
arXiv (Cornell University)20241 citations
ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data
Proceedings of the ACM on Management of Data202428 citations
Epoxy: ACID Transactions across Diverse Data Stores
Proceedings of the VLDB Endowment202315 citations
Accelerating Aggregation Queries on Unstructured Streams of Data
Proceedings of the VLDB Endowment20235 citations
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
arXiv (Cornell University)20234 citations
Exploration with Principles for Diverse AI Supervision
arXiv (Cornell University)2023
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
arXiv (Cornell University)202349 citations
Data Acquisition: A New Frontier in Data-centric AI
arXiv (Cornell University)20233 citations
Zelda: Video Analytics using Vision-Language Models
arXiv (Cornell University)20234 citations
How is ChatGPT's behavior changing over time?
arXiv (Cornell University)2023162 citations
HAPI Explorer: Comprehension, Discovery, and Explanation on History of ML APIs
Proceedings of the AAAI Conference on Artificial Intelligence20231 citations
Ring Attention with Blockwise Transformers for Near-Infinite Context
arXiv (Cornell University)202311 citations
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
arXiv (Cornell University)202346 citations
DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines
arXiv (Cornell University)20232 citations
R <sup>3</sup> : Record-Replay-Retroaction for Database-Backed Applications
Proceedings of the VLDB Endowment20238 citations