Learning to rank through user preference data
Analysis of data aggregated preferences. The ranking of the elements as separate and static elements. Evaluation of the algorithms, assumptions, weight and shifting implicit preferences. The essence of ranking elements as a function of their attributes.
Рубрика | Программирование, компьютеры и кибернетика |
Вид | дипломная работа |
Язык | английский |
Дата добавления | 30.08.2016 |
Размер файла | 147,5 K |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Размещено на http://www.allbest.ru/
NATIONAL RESEARCH UNIVERSITY
HIGHER SCHOOL of ECONOMICS
Faculty of Business and Management
School of Business Informatics
MASTER THESIS
Learning to rank through user preference data
Author:
David Cortes
Supervisor
Attila Kertesz-Farkas
Moscow 2016
Abstract
This work examined approaches for ranking items in electronic catalogs under the intuitive idea of respecting user preferences. Since it is difficult to ask users to state their preferences, these were deduced from click data instead, in the form of “Item A is preferred to Item B”, following ideas from search engines. For example, a clicked item in a ranked list was assumed to be preferred to unclicked items ranked above it. Preferences deduced from these methods have been shown to agree with explicitly-stated preferences, but are biased and not always accurate. It was proposed to treat items either as elements on their own or as defined by their attributes. For the first case, algorithms based on iteratively improving an order by swapping one pair at a time were proposed, among these “Metropolis-Hastings swapping”, and for the second case, an algorithm similar to a one-class SVM that devises utility functions to rank with them was proposed. These were found to generate rankings that respect the preferences to a greater extent than sorting by clicks. ranking algorithm aggregated preference
1. Introduction
Traditionally, retail companies have tried different ideas for placing items in a physical store in a manner that leads to more purchases or increased user satisfaction, and many online retailers have tried ideas to do the same in their electronic “stores”, resulting in developments such as recommender systems and advanced search systems to help search in very wide catalogs such as Amazon, but the problem of ordering the items in the catalogs of more common or ordinary businesses such as clothing stores has not received much attention.
When people browse the online catalogs of a store, they typically do not inspect all the items in the list. Many visitors form their first impression of a retailer based on the items that they see on their first visit. Visitors may abandon a site if nothing of their interest is found among the first items shown to them, and along with them potential purchases are missed. It is reasonable to think that there is something to gain by displaying the items of any given section under some order where the items more likely to be interesting to users appear at the top.
Some of the most common ways of arranging items in online catalogs are to sort them by clicks, purchases, date of introduction, alphabetical order, or price, yet - apart from sorting by clicks or sales - these methods do not seem to favor more popular or more attractive items over less attractive or popular ones. Sorting by clicks or by purchases seems to achieve the goal of putting more popular items first, but intuitively, one would think that items that are displayed first are more likely to be seen and clicked/purchased than items displayed last. This might create a loop where initially high-ranked items are able to receive more clicks/purchases and are deemed even more attractive, while initially lower-ranked items do not get the chance to be clicked/purchased, maintaining the initial ordering regardless of items' attractiveness. Research in the context of search engines' results has shown that this is indeed the case (Craswell et al., 2008). Thus, there is some logic in thinking that the rank (ordinal position in the list) at which an item is clicked is also a useful indicator, but this is not accounted for when sorting items by raw number of clicks/purchases. Sorting by clicks/purchases also logically favors older items over newer items, since the longer an item is displayed in a catalog, the more opportunities it has for receiving clicks or being purchased, but newer items are not necessarily less attractive than older ones. It could be helpful to have some criteria for telling how attractive could a new item be compared to older items - for example, it would be reasonable to think that a newly introduced item which is a copy of an older one with some small modification (such as the same jacket in a different color) should fare similarly to the older one. Moreover, if there are thousands of items, then we cannot reasonably expect all of them to be seen by users, regardless of how good they might be, and it would be helpful to know which of the items that are not typically seen due to their unfavorable position in the list are actually attractive. So, can we do better than sorting by clicks?
The aim of this work is to propose algorithms that could construct an ordering that puts more attractive items first in the default order of an electronic catalog (for example, the “shoes” section of a retailer), bypassing the shortfalls of a ranking by number of clicks.
Attractiveness or popularity, however, are not exact measures and it is hard to assess how popular or attractive a given item is in a given catalog compared to other items. Gathering explicit feedback from customers (e.g. 10-star ratings of movies or ordered preference lists) is generally difficult, expensive and might be burdensome to some users, but so called implicit feedback taken from logs of user activity - that is, customer behavior data which is collected without asking anything to the user, such as hours spent watching a TV series - has been successfully used to make product recommendations (Hu, Koren and Volinsky, 2008) and to rank results in search engines (Joachims, 2002).
Implicit user data has the advantage of being easy to collect in large quantities without associated costs other than data storage, unlike explicit data which usually requires questionnaires, stratified sampling, etc., and without posing any hassles to users. Research has found implicit feedback to be highly correlated with preference in different contexts (Oard and Kim, 2001, Konstan et al., 1997, Joachims et al., 2007, Parra and Amatrian, 2011).
Implicit feedback can be interpreted or processed in different ways. One paradigm that has shown to be useful in some contexts is to interpret certain observed behaviors as stemming from user preferences of the form “Item A is preferred to Item B” (Holland, Ester and KieЯling, 2003, Joachims, 2002), and the same approach of taking pairwise preferences has also been shown to work for product recommendations when using explicit user feedback (Park and Chu, 2009, Salimans, Paquet, and Graepel, 2012). This format of user feedback seems to be especially well suited for the problem of ranking items.
Since not all catalogs have the same characteristics, different strategies might be needed according to the catalog size, the rate of introduction of new items, the presence of item seasons (e.g. winter clothes), among others. Different ideas will be proposed to order the items of the catalogs in different circumstances based on sets of pairwise preferences, whether they are collected explicitly or deduced from customer behavior.
2. Preference Mining
It has long been proposed to gather data from user behavior on web sites to make product recommendations. Typically, data from which items are clicked or which are interacted with (as in played in a music service, watched in a video service, etc.) are used in the context of product recommendations, but in the case of e-commerce it is possible to gather more information. Holland, Ester, and KieЯling (2003) proposed to gather implicit feedback taking also into consideration searches for product names, actions such as adding to cart, purchasing, etc. and developed a framework for extracting preferences of the form “Item A is preferred to Item B” from application logs, based on the number of interactions with different items. However, it is not clear how preferences deduced from this framework relate to actual, explicitly-stated user preferences.
In the context of recommender systems, most of the literature has used these data by recording how much time is spent with an item or how many clicks and purchases it has from each user (for example in Hu, Koren, and Volinsky (2008), Nag, 2008, Rendle et al. (2009), among others). Hu, Koren, and Volinsky (2008) note that this type of data should be interpreted differently from explicit user evaluations (such as 10-star movie ratings), differing mostly in that it cannot be interpreted as showing negative preferences and that stronger feedback (e.g. the more product views or clicks) does not necessarily mean stronger preference, but just stronger confidence in the existence of a preference. However, Parra and Amatriain (2011) dispute the first idea and suggest than in contexts where repeated interactions are frequent, low feedback (as in watching only the first 10 minutes of the first episode of a TV series) can be interpreted as negative preference and they built a model to predict explicit evaluation based on implicit feedback and even went as far as using those predicted preferences in collaborative filtering algorithms as if they were regular explicit feedback (Parra et al., 2011), but they did not test the quality of those recommendations in an online setting.
However, it is somewhat logical to think that in some settings - such as when browsing a catalog - the number of interactions with an item would also heavily depend on how easy it is to get to it or how often it is displayed to users, with results presented earlier or more recurrently having a bias due to their easier accessibility. In this line, Jung, Hong and Kim (2005), for example, presented a recommender system model that corrects for this “availability” bias in TV programs recommendations, and suggested that we can only deduce that there is a preference when we see a higher-than-expected frequency of interaction with an item given its availability bias.
In the context of search engines, the interpretation of implicit feedback has been very different and its relationship with explicit judgments has been more thoroughly studied. One notable logical observation is that if a user scans a ranked list of links and clicks only some of them, then he made a somewhat informed decision not to click the links that were examined before a clicked link, and so it is probably the case that the user prefers the clicked link to the unclicked links that were ranked above, and these preferences could be deduced from click data alone (Joachims, 2002). For example, if a user was shown an ordered list of 10 search results and he clicked links #2 and #5, then that would imply the preferences:
(Following the literature from economics and mathematics, this work will use the symbol ”” for indicating preference, thus means that element A is preferred to element B).
However, we cannot be sure that the user saw all the results in the intended order, or that the clicked link is really preferred to the other(s) without the user having seen in greater details these links (vs. only checking the short description in a search engine) and compared them, so these are noisy preferences. Additionally, the pairwise preferences that can be observed following this methodology depend on the order in which items are displayed, so they are also biased. Nevertheless, a simple algorithm re-ranking web search results based on this kind of generated preference data was tried and it was reported to provide better user satisfaction with query results (in a time when search engines did not apply these methods themselves).
This study was then followed by another study (Joachims et al., 2007) that evaluated the validity of such feedback in Google searches, with some pre-defined search tasks involving navigational and informational queries, aided with the use of eye tracking systems and relevance judgments of the links' abstracts and contents by different people, experimenting with swapping the first and second search results and with presenting the first page results in reversed order. This study evaluated preference-generating rules coming from different hypotheses about user clicking behavior and compared to which extent did the preferences generated from these different hypotheses agree with the relevance judgments from other people - which, notably, had a pairwise inter-rater agreement rate of 89%. Among other things, the study found out that:
· Users tend to inspect the abstracts (link descriptions) of results #1 and #2 in most of their searches, but they tend to click the first one more often, regardless of its relevance, both in the original order from Google and in the altered orderings. Thus, there is probably a trust bias for the quality of Google's results.
· Users mostly scan the search results linearly, and the probability of viewing a link decreases monotonically with rank. Click frequency, on the other hand, also tends to decrease with rank, but not monotonically.
· For ranks 7 to 10, users tend to pay almost the same amount of attention to all the links rather than decreasingly as with the first ones. This 7th rank corresponded to the rank at which the user needed to scroll down in the screen that was used.
· The abstract right below a clicked link is viewed around 50% of the times.
· Reversing the order of the results significantly increased the average rank of the clicks and had more users scrolling further down the list. Also, the same top ranked results received fewer clicks in this setting.
In terms of generating pairwise preferences with click data and comparing them with explicit judgments from other people, they found out that:
· The most consistent (with the external judgments) method of generating preferences was to assume that the last clicked link is preferred to the unclicked links presented above it.
· Other highly consistent methods were assuming any clicked link - not just the last one - to be preferred to the unclicked links ranked above it (overall agreement of 83.1±4.4% for all clicks vs. 83.8±4.6% for only last one) and assuming a clicked link to be preferred only to the unclicked link ranked immediately above it (overall agreement of 81.6±9.5%).
· It was not as consistent to assume a clicked link to be preferred to the unclicked link immediately below (70.4±8%), and even less consistent to assume the last clicked link to be preferred to the clicked links above it (46.9±13.9% - and especially lower when altering the rankings).
· The consistency of the preferences from all the different mechanisms turned out to be more variable with the altered results than with the original ones.
Other studies have also tried to examine the behavioral patterns that can be deduced from implicit feedback in search engines. Craswell et al. (2008) did some experiments in which they swapped search results in adjacent rank positions and and collected clicks to examine different explanatory hypotheses about the rank bias seen when clicking search results (i.e. the tendency to see more clicks in top-ranked results regardless of their perceived relevance). Among the models they tried, the best one (in the sense of better fitting the data) turned out to be the “cascade model”, consisting in that the user scans the results linearly and clicks on the first relevant result that is found. This model is consistent with the data-gathering process from Joachims (2002) and turned out to be very explanatory of the bias present in the first 4 positions, but turned out to perform poorly for lower-ranked results. The authors hypothesize that this might be due to the fact that some users simply leave if no relevant result is found within the top results, or because users tend to examine more results simultaneously when they browse further down the list. This second hypothesis also seems to be consistent with the findings from Joachims et al. (2007). From the perspective of marketing research, it seems to be the case that people tend to make product comparisons as relative to each other rather than as absolutes, and that the results of the comparisons vary depending on the object of reference to which another object is compared (Hardie, Johnson and Fader, 1993).
Agichtein et al. (2006) also ran an experiment in which they generated pairwise preferences from clicks following some of the rules from Joachims et al. (2007) and compared them to human judgments for some fixed queries, in a random sample of search engine traffic rather than in a laboratory setting, and found that preferences generated from assuming a clicked link to be preferred to both the unclicked links above (this alone was used in Joachims (2002)) and the unclicked link below had a slightly higher agreement rate with the explicit judgments than the preferences generated from assuming a clicked link to be preferred to only the unclicked links above. This study also evaluated other means of evaluating preference taking into consideration the rank bias like in Jung, Hong and Kim (2005), and found preferences generated under these methods to be more consistent with external judgments. Dou et al. (2008) did a similar experiment in which they evaluated how would rankings taken from preferences generated as when A has received more clicks than B agree with human explicit judgments, but found the agreement to be low. Unfortunately, they did not evaluate the pairwise agreement rate. Thus, we can conclude that the rank at which the clicks happen is also an important indicator.
In the area of e-commerce, some of the ideas from these findings would likely also hold true and there is probably something to gain by making use of data on users' clicking behavior to mine pairwise preferences between items. This work will follow the search engine approach of deducing preferences for coming up with better orderings, with data provided by an online retailer of clothes that was obtained by recording the order in which items were shown, considering only shown items (thus if, for example, a user skips a product page, then these items are skipped too), and which ones were clicked. Even though in most electronic catalogs the items are displayed in a grid rather than in a vertical list, some assumptions probably still hold, maybe to a lesser degree and even less for items placed in the corners, as there might be a higher proportion of people who do not scan the results sequentially, but validating these assumptions for the case of e-commerce is outside the scope of this work. Unfortunately, due to the way in which most web pages are rendered, it is not possible to know exactly at which row and column did a product appear in a user's browser window.
For the purpose of mining preferences, it must be taken into consideration that product catalogs have some notable differences with search engines: most catalog sections have a larger number of items than search result pages, usually have no “abstract” or short description, but rather a (hopefully) descriptive image, title and price, and users usually inspect more items than in search engine results. Also, since there are different tastes among consumers and items are usually not presented in some optimal ordering (which is the goal of this work), it is not usual for users to blindly click the first item presented, and the average rank of clicked elements tends to be a lot higher than in search engines. In order to illustrate it, here is the click distribution for the first 100 ranks of three different sections of the webpage, counting the number of times that each position in the list received a clicked during a certain time period.
As can be seen, users browsing these catalogs seem to go a lot further down the list than in search engines and do not tend to automatically click anything from the top ranks.
Thus, in the case of product catalogs it seems to be a lot harder to measure the click bias as in Agichtein et al. (2006) or “availability bias” as in Jung, Hong and Kim (2005). As well, there are no “queries” (unless searching a product by name), but rather different sections or categories, and users do not redefine queries, but rather put product filters or order by price or date. And since the sections are predefined and items usually are presented in roughly the same order or some variations (such as by price and by date when the user chooses to), it should be noted that preferences mined like in Joachims (2002) would never be able to confirm the goodness of the current ranking function - for example, if there were only one catalog, and users always chose to sort in it in ascending order by price and click some items, then this rule would not be able to generate preferences of a lower-priced product being preferred to a higher-priced product, even though the users chose to see the cheaper items first, and would only suggest opposite rankings.
In order to combat this, it would be a good idea to follow the approach of Agichtein et al. (2006) and also consider clicked items to be preferred to the unclicked item below. As users tend to inspect a larger number of items in product catalogs than in search engines, this could be taken a bit further such as the clicked item being preferred to the immediately next two items when these are unclicked. In the same line, since there are more elements to evaluate, probably the memory of the products being compared fades and the certainness of there being a preference diminishes for products that are further apart in the rank - so for example, if a user only clicks element #10, then we can probably be more sure about item #10 being preferred to item #9 than about item #10 being preferred to item #1. And since clicks at lower ranks are less common, these clicks might be more valuable than earlier clicks. One way to account for these ideas would be to assign weights to the preferences, with more certain preferences being weighted higher, and clicks that go further also being weighted higher - even though these two effects would contradict each other, the first one should decrease the weights faster for further-apart elements than the second one increases them. The specific methods of assigning weights will be inspected later. These hypotheses are, of course, just guesses and have not been validated yet.
One big problem that remains is that the appearance of these preferences, even if they are always correct, is biased and the probability of observing certain preferences will always be higher than the probability of observing others, and not all pairwise comparisons will be available. Particularly, if items are always shown in the same order and preferences are collected like this, they will always support the reversed ordering of items to be the preferred one. The only way to fix this would be to display items in random order, or to randomize parts of the list at least some times, but that would go against the purpose of achieving better orderings. Perhaps some epsilon-greedy style algorithm could be used to balance exploration-exploitation and mine rarer preferences, but such an approach will not be explored here. Radlinski and Joachims (2006) explore this problem and propose an algorithm called FairPairs which can make the data converge to the true preferences with a “minimally invasive randomization” for the case of search engines, consisting of swapping pairs of adjacent links with a 50% probability when they are displayed to the user. However, unlike in search engines, product catalogs have other features such as the ability to add filters and sort by different criteria, so trying the exact same algorithm with 50% probability might not make the collected preferences converge.
Now, in this work, the analyses that follow will only use click data that is provided by a company, but the data to be collected for deducing preferences doesn't need to stop at clicks and ranks. Other additional data could also be collected for this purpose. For example, Fox et al. (2005) report that other implicit measures can also be predictive of user satisfaction with the content of a given link in search results, such as time spent within a given link and exit type (e.g. going back to the results list, closing the browser window, etc.), and that measures such as the amount of scrolling (searching further down in the list for results) and query reformulations can be predictive of overall satisfaction with a search session.
And the click-collection could also be taken further as has been proposed by Ackerman and Chen (2011), as considering items in later stages of the purchase process to be preferred to the ones in earlier stages - for example, clicked items are preferred to unclicked items, and among those, items added to the shopping cart are preferred to both unclicked items and items not added to the cart, and items purchased are in turned preferred to all the others, beginning with the unclicked items.
Other works in the area of recommender systems for e-commerce have also found the recentness of items - that is, time elapsed since their introduction - to be predictive of overall liking, with newer items being more attractive than older ones. For example, Lee, Park and Park (2008) built a collaborative filtering-based recommender system in which they built pseudo-ratings on the base of product purchases using information on the product launch time and users' purchase time, with more recent items being purchased early on being assigned higher “ratings”, and this system was reported to provide better recommendations than when using only purchase data without the time aspect. In the context of music service recommendations, Parra and Amatriain (2011) also report the recentness of a user's interaction with an item to be predictive of explicit rating of that item.
3. Judging according to aggregated preferences
Having deduced these preferences from different users, these should be able to tell whether a method for ordering items (from here on, a ranking function) or a given ordering of the items is good or at least better than another one. This section will examine criteria to look at.
In search engines and in information retrieval in general, metrics for judging an ordered list usually assume the existence of a relevance score assigned to each element, with better orderings being those that rank more relevant elements higher up the list. One of the typical metrics used is nDCG (normalized discounted cumulative gain), which takes the DCG up to a certain rank position , usually defined as:
and then divides it by the maximum possible DCG. This was the evaluation metric used in Yandex's Personalized Web Search Challenge in 2013-2014 , for example. Unfortunately, such kind of metrics cannot be applied in the case of product catalogs because there are no relevance indicators.
But metrics related to pairwise preferences have also been used, by taking into account the number of inversions that would have been observed in the data if results were displayed under the ranking function being evaluated, with fewer being better - so for example, a ranking function would be penalized if data suggested the preference but it ranked B above A. This was the approach taken in Joachims (2002) and in Agichtein et al. (2006) (although in this second study, the pairwise preferences were taken from explicit judgments rather than from the implicit data), and metrics related to pairwise inversions have also been used in combining search engine results (for example in Freund et al. (2003)).
Problems of ranking or ordering elements have arisen in other domains too, such as in social choice theory, where the input data would be ordered lists of candidates or desired outcomes by different voters, or in recommender systems, where the input data would be ordered lists of a subset of the available elements, among others. Usual criteria for judging a ranking function or an ordered list make use of the input ordered lists, which in this case we do not have either. However, some of the methods intended to come up with an optimal ordering or #1 element simply decompose these input lists into pairwise preferences (such as in Young and Levenglick (1978) and in Schulze (2011)), and output orderings are sometimes also evaluated according to pairwise agreements with the input lists, as is intended here.
But judging with pairwise preferences is not straightforward. One notable observation is that, if aggregating preferences from multiple people by a majority rule (i.e. if there are more cases of than of, then it is decided that), even if these preferences follow the axioms of rational preferences, the resulting aggregation can have cycles of intransitive preferences, where the majority agree that and but then also agree that (known as Condorcet's paradox).
In this regard, social choice theory has defined some desirable properties for ranking functions. One is that, if there is a so-called Condorcet winner, then it should be ranked first (or alternatively, elected). A Condorcet winner is an element (a candidate in voting systems) that, when compared to every other element, is preferred to all of them, but such an element does not always exist. Another desirable property is independence of irrelevant alternatives, which means that the decision of ranking one item over another (by the ranking function) shouldn't depend on the presence or absence of other elements. Unfortunately, the only way that a ranking function that takes arbitrary individual preferences could satisfy these two properties while ranking all of the elements is by being dictatorial, meaning that the output ordered list corresponds to one of the input ordered lists (i.e. the outcome is always determined only by the preferences of one person, thus dictatorial, and the only possible variation is choosing a different dictator), known as Arrow's impossibility theorem. The proof of this theorem only analyzes the positions of items in the output ranking relative to each other under the hypothetical ranking function that satisfies all these properties simultaneously, and is thus applicable to any type of ranking algorithm (Geanakoplos, 2005).
As a result, voting and ranking systems have to decide which of these desirable properties to forego. In voting systems, independence of irrelevant alternatives is usually deemed the least important. In the case of e-commerce, it is intuitive that this property might not be so important either, especially taking into consideration the findings from marketing research from Hardie et al. (1993), which suggests that consumers choosing the better between two brands do not seem to make this decision independent of third brands either. Being non-dictatorial seems important in voting systems, but despite its ugly name, a dictatorial ranking function does not need to be a bad one if a majority of customers share the same preferences - it might even be a desirable property - and in aggregating ranked preferences in other contexts it has been reported that outputting one of the inputs at random might be a good ranking function (Ailon, Charikar and Newman, 2008), but since here we only have pairwise preferences, it is impossible to do that. Being a Condorcet method (that is, always ranking the Condorcet winner #1) is definitely a desirable property and is usually deemed to be the most important one in voting systems, but in an electronic catalog, the effect of misplacing the #1 product might not be as disastrous as in voting systems and this property might not be essential. Additionally, since the pairwise preferences here would be incomplete (i.e. not between all pairs of items), it might not even be possible to identify a Condorcet winner. Unrestricted domain (being able to take any arbitrary rational preferences between the elements as input) is essential though, since this is the kind of data to be used here.
As a conclusion, good criteria for judging a ranking function for electronic catalogs could be that the ordering outputted by it should satisfy the largest possible number of preferences while violating the least possible, and it is not so important for it to meet the criteria under which ranking functions from other domains are typically judged. And since, unlike voting systems, ranking functions here do not need to be transparent or to always output the same result when taking the same input, randomized algorithms could also be used.
Typically, in social choice theory the elements to be ranked are conceived of as whole, indivisible elements, independent of each other, so that the preference between two elements does not relate to the preference between two other different elements - that is, candidate A is not defined by its political party or other attributes, and does not tell anything about . In contrast, products sold online usually have some shared characteristics, and if a person always prefers products of one brand over another, this fact might be used to deduce an unknown preference between two items. As such, it might be better to conceive of the preferences not as being between two entities, but as being between two different sets of attributes, where the attributes are characteristics such as product brand, color, etc. Methods that represent the items under these two views will be explored in the following sections.
4. Ranking items as independent and static elements
From the previous section, if we consider items to be indivisible elements independent from each other, and the items in a given catalog to be static and the catalog not so large (so that we could expect all of them to receive clicks), then it would be reasonable to think that the optimal ordering is the one that ranks all the possible items in a way that maximizes the number of satisfied preferences minus the number of violated preferences, and there might be more than one optimal ordering. This would be the case if we have a small catalog of items that are quite different to each other and new items are not, or rarely, introduced.
For these purposes, since there are likely to be contradictory preferences, we could aggregate them by calculating what is the net magnitude of each preference, by counting how many times is seen in the data and subtracting how many times is seen. In order to simplify it, items could be assigned a numerical ID and these overall pairwise preferences only be calculated for the items where , or the other way around. If there are items, then there would be such possible pairs. This would reduce all the data to simply a small hashtable (with keys being tuples ) or a 1-D array, with space requirement, without losing any information. For the array case, since it is intended to be full and the required space is small anyway, it is possible to efficiently know at which position to insert and look for a pair with the “triangular-matrix” method as described in Leskovec, Rajaraman and Ullman (2014), where the entry is stored in with . Thus, the overall magnitude of would be available in the entry if (otherwise it would be in entry ) and the overall magnitude of would be the same but with opposite sign. Both of these could then be checked in time. Let's call this structure the preferences table.
The problem then becomes about constructing an ordering given weighted, unique pairwise preferences. This problem was studied in Ailon, Charikar and Newman (2008) for aggregating rankings, and in Shapire and Singer (1998). It was found to be NP-complete and some approximation algorithms were proposed, namely, Kwik-Sort and Greedy-Order, respectively, which will be implemented here for comparison purposes.
Whereas many studies have tried to conceive of the collection of pairwise preferences as a directed graph and then follow paths to assign numbers (such as in Schulze (2011), Ackerman and Chen (2011) and Shapire and Singer (1998)), this work will follow a different approach by iteratively improving a given ordering instead - either an initial guess or a random permutation - as such approaches presented more satisfactory results, and algorithms will be compared according to how they perform on the collected preferences rather than on their mathematical minimum guarantees of optimality.
In order to do this, let's define the score of an ordering as the number of preferences it satisfies minus the number of preferences it violates. For any given permutation of the items, it is possible to calculate its score by summing up the entries in the previously described preferences table corresponding to each pair of preferences implied by the ordering, by taking the ordering, generating all the pairwise preferences that it would imply (pairs of items where one is ranked above the other) and summing up the entries from the table for that preference - thus, it implies summing up numbers. However, given just one pair of items it is also possible to calculate the net effect on the score that this particular pair has when being in their current positions. Since each item can only have a preference relationship with other items, one pair of items will only have preference relationships with other items. If we want to compare how much would the score of an ordering change by swapping one pair of items, then it could be calculated as N - thus, it would only imply summing entries, and since it's not necessary to know the original score to check whether one is better than the other, this provides a nice base to try to improve orderings by thoughtfully swapping one pair at a time. In order to visualize how to do this, it could be considered that, given any pair of items, we could do a linear scan over the ranked list and divide the remaining items in three parts where preferences are the same with respect to the two items being evaluated:
A brute-force search would imply examining permutations and calculating for each one its full score, thus it would be , making it infeasible if there are more than a few couple of items. In addition to Kwik-Sort and Greedy-Order, the following algorithms that start from an initial ordering were ideated:
Min-Conflict for weighted lists - : for each possible pair of items in the list, calculate the net effect of having it as it is vs. having it reversed, then swap the pair that brought the highest score improvement, and repeat until no further improvement is possible
Random swaps - : pick a random pair of items and swap them if it improves the score. Repeat for a predefined number of times. Do this again from the beginning several times and output the ordering with the highest score.
Metropolis-Hastings swapping - : pick a random pair of items. If swapping them improves the score, swap them. If swapping the pair does not improve the score, swap them with a small probability inversely proportional to the score decrease. Output the ordering with the highest score improvement seen.
The pseudo-code for all of them can be found in the next subsection.
Since only items that were clicked can be preferred to other items, it is better to first separate the list into clicked and unclicked items, rank the clicked items only, and then append a randomized permutation of the unclicked items at the bottom of this ranked list - this makes the algorithms run faster and in some cases leads to slightly higher scores, and also avoids having unclicked items ranked above clicked items.
From these and the previously mentioned algorithms, Kwik-Sort and Min-Conflict are the only ones guaranteed to always put a potential Condorcet winner in the first place. Random swaps and Metropolis-Hastings swapping would do so with a probability proportional to the number of iterations, but as mentioned earlier, this might not be such an important factor and the comparisons will be based on the previously-defined scores obtained.
It should be mentioned again that, if taking this approach for ranking in the long run, data collected as described in section #2 cannot confirm the current order entirely, and if the perfect order is reached, data would still suggest other orderings. Thus, as mentioned, it would be necessary to use a randomized exploration algorithm to gather preferences that would not be seen otherwise, and maybe some form of online ranking algorithm would be preferred.
The next subsections will compare the performance of the proposed algorithms and then see which orderings would be obtained when assigning different weights to the preferences as described in section 2.
Intuitively, the list of items sorted by clicks should provide a somewhat good ordering under the previously-described preference-generating rules, and it is very easy and fast to just sort the items by clicks. This ordering was taken as a starting point for the algorithms and as a benchmark.
The algorithms were run with implicitly deduced preferences as described in section 2, but they would work just as fine - perhaps even better, at the expense of fewer data - with explicitly-stated user preferences. Due to the complexity of gathering such explicit data in reasonable quantities with these same items, this option was not explored.
Data description: the preferences were taken from the webpage of a retailer specialized in selling fashionable clothes, from a section of the catalog containing colorable books for children, using click and rank data collected during a certain time, generating preferences as each clicked item being preferred to the unclicked items ranked above it. This section of the catalog contains 39 items and there were 37 recorded browsing sessions with at least 1 click, totaling 72 clicks.
Here, Kwik-Sort and Metropolis-Hastings produced a tie. When trying it with data from other webpage sections, in cases when there are more preferences or more items, Metropolis-Hastings turns out to be the best performing algorithm in each case as long as it is run for enough iterations (50,000 always seem to be enough, even when there are 100+ items), and in some cases the difference between the score from Metropolis-Hastings and Kwik-Sort turned out to be relatively large, but the differences between the scores from each algorithm were variable and in some cases those from Greedy-Order were comparable to other algorithms. As well, depending on the random seed, sometimes some algorithms do better starting with the list sorted by clicks, sometimes with a random ordering. Overall, Kwik-Sort was the fastest to execute if doing a single run and not evaluating the score obtained, but running it for 1,000 trials instead of 10 barely raises the score. Random swaps takes around 300 iterations to surpass the score obtained by Greedy-Order, and around 700-1500 to equate that of Min-conflict, but going for higher scores requires much more iterations. Min-conflict was significantly sped-up by starting with the list sorted by some criteria, as it takes fewer iterations to reach an optimum. No algorithm or random seed saw a score higher than 759 for this data, and this might well be the true optimum, but due to the number of items, it cannot be checked by a brute-force search.
The maximum scores from Kwik-Sort, however, are significantly deteriorated if the unclicked items are included in the list, in which case it only reaches a score of 741 after 1,000 iterations, whereas the scores for random swaps and Metropolis-Hastings remain pretty much the same as long as they are run for tens of thousands of iterations. The score for Min-Conflict is not affected by including or excluding the unclicked items, but runs faster without them.
The running times vary a lot depending on the number of items to sort and the numbers here should not be taken as a definitive comparison. For example, Greedy-Order performs a lot slower than Kwik-Sort for larger problems, and Metropolis-Hastings can be a lot faster than Min-Conflict if there are more items.
For comparison purposes, if just trying random orderings, the scores obtained seem to follow a somewhat normal distribution - it seems to have some teeth because scores laying in certain ranges are not possible, and so certain numbers are more concentrated, but this effect diminishes when there are more items. The highest ones do not come close to those obtained from these algorithms:
Thus, they seem to be working well for the problem. However, they were meant for lists of a few dozen items and only Kwik-Sort, Random swaps and Metropolis-Hastings swapping could scale to larger problems. A different approach for ordering larger catalogs will be explored in the next section.
It would be tempting to try to apply some cross-validation holding out part of the data, but since this is an optimization problem, the results would not be comparable to those obtained by an algorithm run on the full data. To illustrate, suppose we have 3 preference pairs that end up in 3 different folds:
Here, if each ordering ends up satisfying the constraints in each fold, then we will have a score of +3, but an ordering could only achieve a score of +1 for this pair for the whole dataset. Likewise, if we have 4 preferences in different folds:
And for each fold the ordering does not satisfy each constraint, then we will have a score of -4, but this pair cannot produce any score at all when using the whole dataset.
Assumptions, weights and bias in implicit preferences
It is interesting to see what happens if preferences are deduced under different assumptions, and if they are weighted. Clicks at lower ranks are rarer, and we would expect more clicks at higher ranks, so maybe clicks should be weighted by the rank at which they happen. However, it might not be a good idea to just rescale according to the number of clicks seen at each rank, since this also reflects the attractiveness of each item if orderings are not altered regularly (the most likely case). A better idea could be to smooth the number of clicks by fitting a line via the least-squares method, and divide the weight of each preference by the estimated clicks at the rank the click happens. In this catalog section, the values are quite close for all ranks, so there does not seem to be much of a presentation bias - in other words, people do not blindly click the first thing they are shown like in search engines. The result is illustrated below:
As well, it might not be the case that users really compare items that are too far apart from each other in the list, as human memory and attention span is limited, and it might be the case that when people browse deep down the list they forget what items were at the top. As such, it might be desirable to weight the preferences lower the further apart two items are in the list. An exponential decay function was chosen for these purposes, as it has the property of decreasing the weight by the same proportion with each difference in rank.
And it might be the case that a lot of people also compare a clicked item to the item that is next but not clicked. The ordering obtained by optimizing under these preference-generating assumptions and weights will be compared to the orderings obtained by:
· Generating preferences as in Joachims (2002) (from here on referred as Joachims): a clicked item is preferred to all the unclicked items that were ranked above it.
· Generating preferences as in Agichtein et al. (2006) (here referred to as Agichtein): a clicked item is preferred to all the unclicked items that were ranked above it AND to the unclicked item ranked immediately below.
· Number of click after de-biasing (like in Jung, Hong and Kim (2005), here referred to as De-biased clicks): for every click that an item receives, take the inverse of the expected number of clicks at the rank where it happened, obtained by the least-squares method as previously illustrated, and rank items by their decreasing sum of these quantities.
· Just as a comparison point, by generating a preference only for the unclicked item ranked immediately above a clicked item - let's call this last one Pref. last.
Two ways of comparing the orders are by evaluating the number of clicks that the element at each rank received, and the sum of the ranks of the clicks that each element received (e.g. an item that was clicked at position #1 and then at position #2 gets a 3). The similarity will also be measured by Kendall's Tau-b (a variation of Kendall's Tau that can deal with ties).
Since they are all ranking the very same items, it is also possible to compare them according to the fraction of pairs of items that are ranked in the same relative order to each other (one above the other) - this could be converted to a correlation if subtracting the number of pairs that are in inverse relative order - and by their symmetric AP correlation coefficient as described in Yilmaz, Aslam and Robertson (2008), which is a similar measure but weighting the upper ranks higher. One problem with this approach is that they cannot deal with ties, but they can still provide a point of comparison between two orderings of unique elements. The unclicked items will be removed for the numerical correlations.
Some interesting conclusions can be drawn from these results:
· Trying to de-bias the items by the rank of the clicks to then rank them by de-biased clicks did not make any difference with respect to just ranking them by clicks.
· The ordering obtained from weighted preferences turned out, surprisingly, to be more similar to the one obtained from generating preferences as Joachims than as Agichtein, but in terms of assigning more importance to the ranks of the clicks, it was more similar to Agichtein.
· The weighted preferences seemed to penalize the first-ranked item a lot (in this case it was the one with 6 clicks), and the rank of the item with the most clicks gets lower if the decaying ratio is increased.
· Optimizing the weighted preferences problem seems to be harder.
· Preferences as Agichtein seemed to provide a more balanced order, taking into account both clicks and ranks of the clicks, without penalizing the first ranks too much.
...Подобные документы
Data mining, developmental history of data mining and knowledge discovery. Technological elements and methods of data mining. Steps in knowledge discovery. Change and deviation detection. Related disciplines, information retrieval and text extraction.
доклад [25,3 K], добавлен 16.06.2012A database is a store where information is kept in an organized way. Data structures consist of pointers, strings, arrays, stacks, static and dynamic data structures. A list is a set of data items stored in some order. Methods of construction of a trees.
топик [19,0 K], добавлен 29.06.2009Проблемы оценки клиентской базы. Big Data, направления использования. Организация корпоративного хранилища данных. ER-модель для сайта оценки книг на РСУБД DB2. Облачные технологии, поддерживающие рост рынка Big Data в информационных технологиях.
презентация [3,9 M], добавлен 17.02.2016Классификация задач DataMining. Создание отчетов и итогов. Возможности Data Miner в Statistica. Задача классификации, кластеризации и регрессии. Средства анализа Statistica Data Miner. Суть задачи поиск ассоциативных правил. Анализ предикторов выживания.
курсовая работа [3,2 M], добавлен 19.05.2011Web Forum - class of applications for communication site visitors. Planning of such database that to contain all information about an user is the name, last name, address, number of reports and their content, information about an user and his friends.
отчет по практике [1,4 M], добавлен 19.03.2014Описание функциональных возможностей технологии Data Mining как процессов обнаружения неизвестных данных. Изучение систем вывода ассоциативных правил и механизмов нейросетевых алгоритмов. Описание алгоритмов кластеризации и сфер применения Data Mining.
контрольная работа [208,4 K], добавлен 14.06.2013Совершенствование технологий записи и хранения данных. Специфика современных требований к переработке информационных данных. Концепция шаблонов, отражающих фрагменты многоаспектных взаимоотношений в данных в основе современной технологии Data Mining.
контрольная работа [565,6 K], добавлен 02.09.2010Основы для проведения кластеризации. Использование Data Mining как способа "обнаружения знаний в базах данных". Выбор алгоритмов кластеризации. Получение данных из хранилища базы данных дистанционного практикума. Кластеризация студентов и задач.
курсовая работа [728,4 K], добавлен 10.07.2017Історія виникнення комерційних додатків для комп'ютеризації повсякденних ділових операцій. Загальні відомості про сховища даних, їх основні характеристики. Класифікація сховищ інформації, компоненти їх архітектури, технології та засоби використання.
реферат [373,9 K], добавлен 10.09.2014Значение атрибута TITLE тега HTML-документа. Возможности HTML для разработчиков Web-страниц. Параметры тега , регулирующие отступы вокруг изображения. Оформление комментариев в CSS. Теги логического форматирования текста (phrase elements).
тест [19,9 K], добавлен 11.10.2012Особенности работы с графическими изображениями Java Script. Способы динамического управления слоями. Рассмотрение примеров использования операторов цикла. Характеристика свойств объекта form: encoding, elements, checkbox. Возможности документов HTML.
курсовая работа [167,7 K], добавлен 09.02.2013Роль информации в мире. Теоретические основы анализа Big Data. Задачи, решаемые методами Data Mining. Выбор способа кластеризации и деления объектов на группы. Выявление однородных по местоположению точек. Построение магического квадранта провайдеров.
дипломная работа [2,5 M], добавлен 01.07.2017Определение программы управления корпоративными данными, ее цели и предпосылки внедрения. Обеспечение качества данных. Использование аналитических инструментов на базе технологий Big Data и Smart Data. Фреймворк управления корпоративными данными.
курсовая работа [913,0 K], добавлен 24.08.2017Анализ проблем, возникающих при применении методов и алгоритмов кластеризации. Основные алгоритмы разбиения на кластеры. Программа RapidMiner как среда для машинного обучения и анализа данных. Оценка качества кластеризации с помощью методов Data Mining.
курсовая работа [3,9 M], добавлен 22.10.2012Методика и основные этапы построения модели бизнес-процессов верхнего уровня исследуемого предприятия, его организационной структуры, классификатора. Разработка модели бизнес-процесса в IDEF0 и в нотации процедуры, применением Erwin Data Modeler.
курсовая работа [1,6 M], добавлен 01.12.2013Изучение возможностей AllFusion ERwin Data Modeler и проектирование реляционной базы данных (БД) "Санатория" на основе методологии IDEF1x. Определение предметной области, основных сущностей базы, их первичных ключей и атрибутов и связи между ними.
лабораторная работа [197,5 K], добавлен 10.11.2009Перспективные направления анализа данных: анализ текстовой информации, интеллектуальный анализ данных. Анализ структурированной информации, хранящейся в базах данных. Процесс анализа текстовых документов. Особенности предварительной обработки данных.
реферат [443,2 K], добавлен 13.02.2014Характеристика та класифікація CASE-засобів, технологія їх впровадження. Структура і функції CASE-засобу Silverrun. Переваги, результати застосування та ключові функції CA ERwin Data Modeler. Проектування роботи інтернет-магазину за допомогою UML-діаграм.
курсовая работа [1,5 M], добавлен 07.02.2016Общее понятие о системе Earth Resources Data Analysis System. Расчет матрицы преобразования космоснимка оврага. Инструменты геометрической коррекции, трансформирование. Создание векторных слоев. Оцифрованные классы объектов. Процесс подключения скрипта.
курсовая работа [4,3 M], добавлен 17.12.2013Управление электронным обучением. Технологии электронного обучения e-Learning. Программное обеспечение для создания e-Learning решений. Компоненты LMS на примере IBM Lotus Learning Management System и Moodle. Разработка учебных курсов в системе Moodle.
курсовая работа [146,6 K], добавлен 11.06.2009