Test

The Sim4IR Workshop

At SIGIR 2021

Speakers and Papers at Sim4IR

Sim4IR has a diverse range of speakers, covering a range of different topics related to simulation in the Information Retrieval domain. We have three kinds of talk:

To view a detailed schedule, please check out the schedule page.

Keynotes

Dave Hawking How useful are results from simulated test collections?
David Hawking, ANU/Microsoft, Canberra, Australia
Session A
Download Slides
Reporting joint work with Bodo Billerbeck, Nick Craswell and Paul Thomas, all of Microsoft.Cloud companies which host document data on behalf of clients wish to provide effective and responsive indexing and retrieval services for their clients, but staff from the cloud company are usually prevented from looking at client data, queries, and click logs. Microsoft examples include Exchange 365 and the Azure Cloud. The open source SynthaCorpus system provides techniques for statistically emulating a corpus and a query log, as well as for generating corpus-appropriate known-item test sets. But how well do results from a simulated collection predict the real performance for the client? We put SynthaCorpus methods to the test using three open source search systems (Indri, Terrier and ATIRE, each using a different ranking model) and four TREC datasets (AP, FR, Patents, and WT10g). We also investigate the trade-off between privacy and predictive accuracy by including two forms of emulation by encryption. Finally, to investigate the influence of “noise” we report predictions for the /bin/cp emulation method. My opinion is that the findings show that there are simulation methods which are capable of making predictions accurate enough for practical use while adequately preserving privacy. However, in reality that decision would be in the hands of clients. We found that there is indeed a trade-off between privacy and prediction accuracy, and that there are interesting interactions between retrieval system and dataset. In the 1990s, while at ANU, David Hawking was a co-ordinator of the TREC Very Large Collection and Web Tracks, and co-creator of a number of well-known IR test collections. He was an IR researcher at CSIRO for ten years and his research was spun off into the Funnelback enterprise and web search company, where he worked as Chief Scientist. In 2013 he joined Microsoft, working on the Bing search engine until his retirement in 2018. Since retiring he has retained an honorary position at ANU.
ChengXiang Zhai User Simulation for Information Retrieval Evaluation: Opportunities and Challenges
ChengXiang Zhai, UIUC, Illinois, US
Session B
Download Slides
As the utility of an Information Retrieval (IR) system depends on the users who use the system, evaluation of an IR system logically always requires specification of the assumed users (explicitly or implicitly). To make IR experiments reproducible, the specification of a user further needs to be in the form of an executable user simulator, i.e., a software agent that can simulate a user’s behavior when interacting with an IR system; with such user simulators, evaluation of any IR system can then be done by having the system interact with many user simulators and measuring how successful a simulated user is in finishing a search task and how much effort the simulated user has to make. In this talk, I will provide a broad discussion of the opportunities and challenges in evaluating IR systems using user simulation and briefly present some of our recent work on tackling those challenges, including a general formal framework for evaluating IR systems using search simulation, some specific models and algorithms for user simulation, some experience with evaluating interactive IIR systems with user simulation, and a tester-based evaluation methodology for evaluating and comparing user simulators. I will conclude with some promising future directions in this area.ChengXiang Zhai is a Donald Biggar Willett Professor in Engineering of Department of Computer Science at the University of Illinois at Urbana-Champaign (UIUC), where he also holds a joint appointment at Carl R. Woese Institute for Genomic Biology, Department of Statistics, and School of Information Sciences. His research interests include intelligent information retrieval, text mining, natural language processing, machine learning, and their applications. He has published over 300 papers in these areas. He served as Associate Editors for major journals in multiple areas including information retrieval (ACM TOIS, IPM), data mining (ACM TKDD), and medical informatics (BMC MIDM), Program Co-Chairs of NAACL HLT’07, SIGIR’09, and WWW’15, and Conference Co-Chairs of CIKM’16, WSDM’18, and IEEE BigData’20. He is an ACM Fellow and a member of ACM SIGIR Academy. He received numerous awards, including ACM SIGIR Test of Time Paper Award (three times), the 2004 Presidential Early Career Award for Scientists and Engineers (PECASE), Alfred P. Sloan Research Fellowship, IBM Faculty Award, HP Innovation Research Award, Microsoft Beyond-Search Research Award, UIUC Rose Award for Teaching Excellence, and UIUC Campus Award for Excellence in Graduate Student Mentoring.

Paper Presentations

Original papers presented at Sim4IR are available as part of a CEUR-WS joint proceedings with the CSR 2021 Workshop @ SIGIR. This can be accessed at http://ceur-ws.org/Vol-2911/.

Assessing Query Suggestions for Search Session Simulation
Sebastian Günther and Matthias Hagen
Session A
Access Paper
Research on simulating search behavior has mainly dealt with result list interactions in the recent years. We instead focus on the querying process and describe a pilot study to assess the applicability of search engine query suggestions to simulate search sessions (i.e., sequences of topically related queries). In automatic and manual assessments, we evaluate to what extent a session detection approach considers the simulated query sequences as “authentic” and how humans perceive the sessions’ quality in the sense of coherence, realism, and representativeness of the underlying topic. As for the actual suggestion-based simulation, we compare different approaches to select the next query in a sequence (always selecting the first suggestion, random sampling, or topic-informed selection) to the human TREC Session track sessions and a previously suggested simulation scheme. Our results show that while it is easy to create query logs that are authentic to both users and automated evaluation, keeping the sessions related to an underlying topic can be difficult when relying on given suggestions only.
ULTRE framework: a framework for Unbiased Learning to Rank Evaluation based on simulation of user behavior
Yurou Zhao, Jiaxin Mao, Qingyao Ai
Session A
Access Paper
Download Slides
Unbiased learning to rank (ULTR) with biased user behavior data has received considerable attention in the IR community.However, how to properly evaluate and compare different ULTR approaches has not been systematically investigated and there is no shared task or benchmark that is specifically developed for ULTR. In this paper, we propose the Unbiased Learning to Rank Evaluation(ULTRE) framework. The proposed framework utilizes multiple click models in generating simulated click logs and supports the evaluation of both the offline, counterfactual and the online, bandit-based ULTR models. Our experiments show that the ULTRE framework are effective in click simulation and comparing different ULTR models.
State of the Art of User Simulation approaches for conversational information retrieval
Pierre Erbacher, Laure Soulier, Ludovic Denoyer
Session B
Access Paper
Conversational Information Retrieval (CIR) is an emerging field of Information Retrieval (IR) at the intersection of interactive IR and dialogue systems for open domain information needs. In order to optimize these interactions and enhance the user experience, it is necessary to improve IR models by taking into account sequential heterogeneous user-system interactions. Reinforcement learning has emerged as a paradigm particularly suited to optimize sequential decision making in many domains and has recently appeared in IR. However, training these systems by reinforcement learning on users is not feasible. One solution is to train IR systems on user simulations that model the behavior of real users. Our contribution is twofold: 1) reviewing the literature on user modeling and user simulation for information access, and 2) discussing the different research perspectives for user simulations in the context of CIR.

Encore Talks

Sim4IR will also have eight encore talks. These short, ten minute talks provide an insight into recently published research considering simulation in the context of Information Retrieval.

Ben Carterette How Am I Doing?: Evaluating Conversational Search Systems Offline
Ben Carterette, Spotify, US
Session B
As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important respects: conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions, and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these differences, traditional methods for search evaluation (such as the Cranfield paradigm) do not translate easily to conversational search. In this work, we propose a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model. The framework is based on the idea of “subtopics”, often used to model novelty and diversity in search and recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we know, this is the first work to combine these ideas into a comprehensive framework for offline evaluation of conversational search.
Claudia Hauff SIREN: a simulation framework for understanding the effects of recommender systems in online news environments
Claudia Hauff, TU Delft, The Netherlands
Session A
Download Slides
The growing volume of digital data stimulates the adoption of recommender systems in different socioeconomic domains, including news industries. While news recommenders help consumers deal with information overload and increase their engagement, their use also raises an increasing number of societal concerns, such as “Matthew effects”, “filter bubbles”, and the overall lack of transparency. We argue that focusing on transparency for content-providers is an under-explored avenue. As such, we designed a simulation framework called SIREN1 (SImulating Recommender Effects in online News environments), that allows content providers to (i) select and parameterize different recommenders and (ii) analyze and visualize their effects with respect to two diversity metrics. Taking the U.S. news media as a case study, we present an analysis on the recommender effects with respect to long-tail novelty and unexpectedness using SIREN. Our analysis offers a number of interesting findings, such as the similar potential of certain algorithmically simple (item-based k-Nearest Neighbour) and sophisticated strategies (based on Bayesian Personalized Ranking) to increase diversity over time. Overall, we argue that simulating the effects of recommender systems can help content providers to make more informed decisions when choosing algorithmic recommenders, and as such can help mitigate the aforementioned societal concerns.
Jin Huang Keeping Dataset Biases out of the Simulation: A Debiased Simulator for Reinforcement Learning based Recommender Systems
Jin Huang, UvA, The Netherlands
Session B
Reinforcement learning for recommendation (RL4Rec) methods are increasingly receiving attention as an effective way to improve long-term user engagement. However, applying RL4Rec online comes with risks: exploration may lead to periods of detrimental user experience. Moreover, few researchers have access to real-world recommender systems. Simulations have been put forward as a solution where user feedback is simulated based on logged historical user data, thus enabling optimization and evaluation without being run online. While simulators do not risk the user experience and are widely accessible, we identify an important limitation of existing simulation methods. They ignore the interaction biases present in logged user data, and consequently, these biases affect the resulting simulation. As a solution to this issue, we introduce a debiasing step in the simulation pipeline, which corrects for the biases present in the logged data before it is used to simulate user behavior. To evaluate the effects of bias on RL4Rec simulations, we propose a novel evaluation approach for simulators that considers the performance of policies optimized with the simulator. Our results reveal that the biases from logged data negatively impact the resulting policies, unless corrected for with our debiasing method. While our debiasing methods can be applied to any simulator, we make our complete pipeline publicly available as the Simulator for OFfline leArning and evaluation (SOFA): the first simulator that accounts for interaction biases prior to optimization and evaluation.
David Maxwell Modelling Search and Stopping in Interactive Information Retrieval
David Maxwell, TU Delft, The Netherlands
Session B
Searching for information when using a computerised retrieval system is a complex and inherently interactive process. Individuals during a search session may issue multiple queries, and examine a varying number of result summaries and documents per query. Searchers must also decide when to stop assessing content for relevance - or decide when to stop their search session altogether. Despite being such a fundamental activity, only a limited number of studies have explored stopping behaviours in detail, with a majority reporting that searchers stop because they decide that what they have found feels “good enough”. Notwithstanding the limited exploration of stopping during search, the phenomenon is central to the study of Information Retrieval, playing a role in the models and measures that we employ. However, the current de facto assumption considers that searchers will examine k documents - examining up to a fixed depth.
Alexandre Salle Studying the Effectiveness of Conversational Search Refinement Through User Simulation
Alexandre Salle, UFRGS, Brazil
Session B
A key application of conversational search is refining a user’s search intent by asking a series of clarification questions, aiming to improve the relevance of search results. Training and evaluating such conversational systems currently requires human participation, making it unfeasible to examine a wide range of user behaviors. To support robust training/evaluation of such systems, we propose a simulation framework called CoSearcher that includes a parameterized user simulator controlling key behavioral factors like cooperativeness and patience. Using a standard conversational query clarification benchmark, we experiment with a range of user behaviors, semantic policies, and dynamic facet generation. Our results quantify the effects of user behaviors, and identify critical conditions required for conversational search refinement to be effective.
Manel Slokom Towards User-Oriented Privacy for Recommender System Data: A Personalization-based Approach to Gender Obfuscation for User Profiles
Manel Slokom, TU Delft, The Netherlands
Session A
Download Slides
This paper proposes a new privacy solution for the data used to train a recommender system, i.e., the user-item matrix. The solution, called Personalized Blurring (PerBlur), is a simple, yet effective, approach to adding and removing items from users’ profiles in order to generate an obfuscated user-item matrix. The novelty of PerBlur is to personalize the choice of items used for obfuscation to the individual user profiles. PerBlur is formulated within a user-oriented paradigm of recommender system data privacy that aims at making privacy solution understandable, unobtrusive, and useful for the user. When obfuscated data is used for training, a recommender system algorithm is able to reach performance comparable to what is attained when it is trained on the original, unobfuscated data. At the same time, a classifier can no longer reliably use the obfuscated data to predict the gender of users, indicating that implicit gender information has been removed. In addition to introducing PerBlur, we make several key contributions. First, we propose an evaluation protocol that creates a fair environment to compare between different obfuscation conditions. Second, we carry out experiments that show that gender obfuscation impacts the fairness and diversity of recommender system results. We show that PerBlur maintains fairness by not causing a gender-specific drop in recommender system performance. We also demonstrate the ability of PerBlur, through its greedy removal, to recommend a smaller proportion of gender-stereotypical items, i.e., items that are highly specific to a particular gender. In sum, our work establishes that a simple, transparent approach to gender obfuscation can protect user privacy while at the same time improving recommendation results for users by maintaining fairness and enhancing diversity.
Junqi Zhang Context-Aware Ranking by Constructing a Virtual Environment for Reinforcement Learning
Junqi Zhang, Tsinghua University, China
Session A
Result ranking is one of the major concerns for Web search technologies. Most existing methodologies rank search results in descending order according to pointwise relevance estimation of single results. However, the dependency relationship between different search results are not taken into account. While search engine result pages contain more and more heterogenous components, a better ranking strategy should be a context-aware process and optimize result ranking globally. In this paper, we propose a novel framework which aims to improve context-aware listwise ranking performance by optimizing online evaluation metrics. The ranking problem is formalized as a Markov Decision Process (MDP) and solved with the reinforcement learning paradigm. To avoid the great cost to online systems during the training of the ranking model, we construct a virtual environment with millions of historical click logs to simulate the behavior of real users. Extensive experiments on both simulated and real datasets show that: 1) constructing a virtual environment can effectively leverage the large scale click logs and capture some important properties of real users. 2) the proposed framework can improve search ranking performance by a large margin.
Shuo Zhang Evaluating Conversational Recommender Systems via User Simulation
Shuo Zhang, Bloomberg, London
Session B
Download Slides
Conversational information access is an emerging research area. Currently, human evaluation is used for end-to-end system evaluation, which is both very time and resource intensive at scale, and thus becomes a bottleneck of progress. As an alternative, we propose automated evaluation by means of simulating users. Our user simulator aims to generate responses that a real human would give by considering both individual preferences and the general flow of interaction with the system. We evaluate our simulation approach on an item recommendation task by comparing three existing conversational recommender systems. We show that preference modeling and task-specific interaction models both contribute to more realistic simulations, and can help achieve high correlation between automatic evaluation measures and manual human assessments.