Abstract
Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new łDBMS-for-MLž with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungryÐwith ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3× over state-of-the-art baselines.
Original language | English |
---|---|
Pages (from-to) | 4230-4239 |
Number of pages | 10 |
Journal | Proceedings of the VLDB Endowment |
Volume | 16 |
Issue number | 13 |
DOIs | |
Publication status | Published - 2023 |
Event | 49th International Conference on Very Large Data Bases, VLDB 2023 - Vancouver, Canada Duration: 2023 Aug 28 → 2023 Sept 1 |
Bibliographical note
Publisher Copyright:© 2023, VLDB Endowment. All rights reserved.
All Science Journal Classification (ASJC) codes
- Computer Science (miscellaneous)
- General Computer Science