Optimizing Data Pipelines for Machine Learning in Feature Stores

Rui Liu, Kwanghyun Park, Fotis Psallidas, Xiaoyong Zhu, Jinghui Mo, Rathijit Sen, Matteo Interlandi, Konstantinos Karanasos, Yuanyuan Tian, Jesús Camacho-Rodríguez

Research output: Contribution to journalConference articlepeer-review

1 Citation (Scopus)

Abstract

Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new łDBMS-for-MLž with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungryÐwith ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3× over state-of-the-art baselines.

Original languageEnglish
Pages (from-to)4230-4239
Number of pages10
JournalProceedings of the VLDB Endowment
Volume16
Issue number13
DOIs
Publication statusPublished - 2023
Event49th International Conference on Very Large Data Bases, VLDB 2023 - Vancouver, Canada
Duration: 2023 Aug 282023 Sept 1

Bibliographical note

Publisher Copyright:
© 2023, VLDB Endowment. All rights reserved.

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'Optimizing Data Pipelines for Machine Learning in Feature Stores'. Together they form a unique fingerprint.

Cite this