This paper studies data optimization for Learning to Rank (LtR), by dropping training labels to increase ranking accuracy. Our work is inspired by data dropout, showing some training data do not positively influence learning and are better dropped out, despite a common belief that a larger training dataset is beneficial. Our main contribution is to extend this intuition for noisy-and semi-supervised LtR scenarios: some human annotations can be noisy or out-of-date, and so are machine-generated pseudo-labels in semi-supervised scenarios. Dropping out such unreliable labels would contribute to both scenarios. State-of-the-arts propose Influence Function (IF) for estimating how each training instance affects learn-ing, and we identify and overcome two challenges specific to LtR. 1) Non-convex ranking functions violate the assumptions required for the robustness of IF estimation. 2) The pairwise learning of LtR incurs quadratic estimation overhead. Our technical contributions are addressing these challenges: First, we revise estimation and data optimization to accommodate reduced reliability; Second, we devise a group-wise estimation, reducing cost yet keeping accuracy high. We validate the effectiveness of our approach in a wide range of ad-hoc information retrieval benchmarks and real-life search engine datasets in both noisy-and semi-supervised scenarios.
|Title of host publication||ICTIR 2020 - Proceedings of the 2020 ACM SIGIR International Conference on Theory of Information Retrieval|
|Publisher||Association for Computing Machinery|
|Number of pages||8|
|Publication status||Published - 2020 Sept 14|
|Event||6th ACM SIGIR / 10th International Conference on the Theory of Information Retrieval, ICTIR 2020 - Virtual, Online, Norway|
Duration: 2020 Sept 14 → 2020 Sept 17
|Name||ICTIR 2020 - Proceedings of the 2020 ACM SIGIR International Conference on Theory of Information Retrieval|
|Conference||6th ACM SIGIR / 10th International Conference on the Theory of Information Retrieval, ICTIR 2020|
|Period||20/9/14 → 20/9/17|
Bibliographical noteFunding Information:
This work is partly supported by Artificial Intelligence Graduate School Program (2020-0-01361) and ITRC support program (IITP-2020-2020-0-01789) supervised by IITP.
© 2020 ACM.
All Science Journal Classification (ASJC) codes
- Computer Science (miscellaneous)
- Information Systems