Funds of the library
Research and characteristic of the main features of the recommender system is knowledge-based recommendation. Acquaintance with the advantages and disadvantages of english wikipedia model, which is the most appropriate model for the topic recommender.
Рубрика | Программирование, компьютеры и кибернетика |
Вид | курсовая работа |
Язык | английский |
Дата добавления | 25.06.2017 |
Размер файла | 91,0 K |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Размещено на http://www.allbest.ru/
Contents
Introduction
1. Literature overview
1.1 The reasons for implementing a RS
1.2 Types of RS
1.3 Non-personalized recommendations
1.4 Basic structure of a content-based RS
1.5 Advantages and drawbacks
1.6 Advanced methods: Ontologies
1.7 Evaluation of RS
2. Methods
2.1 Graph
2.2 VSM
2.3 WordNet
2.4 Public data sets: why they are not used in recommendations
2.5 Weights of recommender sources
3. Results
3.1 Interfaces
3.2 Clusters
Conclusion
References
Introduction
Recommender systems are widely used nowadays. Many Internet users have come across them while shopping online - for instance, such a system is implemented into the website of a huge retailer Amazon. As a matter of fact, one of the biggest Internet bulletin boards in Russia Avito is currently holding a contest for the best recommender system. Although online-shopping websites are a good example, the areas that recommender systems may be applied to are literally unlimited - from news articles to songs and films.
The New York Public Library provides an enormous collection of digital items, such as images, maps, sound recordings or videos. There are in total 707,378 items, some of which are tagged with key words. It would be interesting to explore the system of key words, as the collection is rather rich and covers a great amount of topics. First, I would like to build a recommender system for digital items using their tags. Next, taking into account the data of recommender system and tags' co-occurrence, the relatedness of tags to each other would be determined. Their similarity can then be visualized with a graph or with a semantic map, which will represent the whole system of key words.
The New York Public Library Digital Collections do not use a recommendation system, although their data is divided into sub-collections with the help of the additional information about the items. A rather simplified strategy of recommendations that I will be using can be subsequently improved and implement into the website of Digital Collections. Moreover, after comparing the system of key words to other semantic representations I will determine if it is possible to build a semantically reliable tag system using exclusively information about their co-occurrence.
The New York Public Library Digital Collections have an API, which can be used to retrieve digital items and their data. It is possible to access the API using a background application, such as Python. The content of the items is described using the MODS XML standard, which makes it easy to extract, for example, item's genre or topic. Using Python 3 and packages for NLP, machine learning and visualization should be sufficient to analyze the data and accomplish my goals.
The paper is structured as follows. Section 1 provides a literature overview of the existing recommender system types and various algorithms used for building recommendation systems. Section 2 discusses the methods I applied during the research and compares the results of different methods. Finally, section 3 summaries the outcomes of the research.
1. Literature overview
1.1 The reasons for implementing a RS
There are plenty of reasons for implementing a recommender system in a database, both for providers and for users. The effect of a recommender system that appeals to providers most is probably an increase in the number of items that a user consumes. For instance, Amazon experienced a 29% growth in sales after introducing a recommender system. A good recommender system also provides a diverse set of suggestions. Therefore, a user may receive information about a suitable item that they would not have found without a recommender. Moreover, recommender systems tend to cope with the information overload problem, which has become so urgent in the recent years. The range of choice may be too great for a user and can lead to poor decisions, as long as no recommendations are provided. All these factors undoubtedly lead to a better correspondence between users' intentions and the results provided by the system, as well as to a consequent increase of users' satisfaction.
1.2 Types of RS
According to (Jannach et al. 2011), there are three main types of recommender systems, when excluding the hybrid type which stand for the combination of two or more approaches. The most well-known strategy is collaborative recommendation, which takes into account solely the profiles of users. This approach assumes that the users who liked the same items in the past would also have similar preferences in the future. A collaborative recommender system, therefore, suggests the items that the user did not share with the one who has similar tastes. As collaborative approach is based on users profiles, it does not require any information about items characteristics, but it becomes impossible to recommend any new items that have not been rated by the users yet.
Another type of recommender system is content-based, which makes predictions about items a user might like by comparing item descriptions and a user profile. In these systems, information about user profiles can be gained either by implicitly tracking users actions or by explicitly asking questions about items the user is interested in. There are technical and qualitative item descriptions, where technical are usually easy to extract automatically, such as main key words of a news article. Qualitative descriptions, however, tend to contain manually entered information, which is often costly and highly time-demanding. An example of this approach is the “Music Genome Project”, where every song is manually labelled with up to several hundred features. Thus, an annotation of a song takes from 20 to 30 minutes on average (Jannach et al. 2011).
Unlike collaborative recommender systems, content-based recommendation does not depend on a history of users actions or large user groups. Therefore, in this approach there is no problem concerning new items, which can be immediately recommended with the use of their features.
The last type of recommender system is knowledge-based recommendation. It resembles content-based recommendation, as it also takes into consideration items descriptions and users profiles. In knowledge-based recommenders, however, an interaction between a user and the system is more widely implemented - for instance, by asking questions or through an interactive dialog. The reason for that is a large number of one-time buyers in such systems, which can be the case for electronics stores. In this way, the system acquires more detailed and specific information about user preferences and matches them with manually provided item features.
I believe that in case of NYPL collections matching item annotations with the exact key words the user typed does not always give the information that the user is looking for. A few techniques can be used to broaden the range of the topics the user receives for his query. For fulfilling this purpose, it is enough to take into account only the present query and reject personal preferences of the user. Hence, collaborative and knowledge-based approaches do not suit the case very well. That is why I will be creating a slightly modified type of content-based recommender system.
1.3 Non-personalized recommendations
There is a popular view that recommender systems are only used to provide personalized suggestions, because non-personalized recommendations, such as top-ten selections of books, are too general and easy to generate (Ricci et al. 2011). In my opinion, in some cases using a search query as the only source of information about user preferences can also supply one with reliable and satisfactory recommendations. However, a certain degree of items diversity is needed - for example, this strategy will not work for a bookshop, but may apply in a clothes store, when recommending a user to buy an umbrella along with rubber boots they are looking for. As for the NYPL digital collections, this degree of diversity is guaranteed by the number of various topics and the different fields these topics cover.
1.4 Basic structure of a content-based RS
A basic content-based recommendation consists of 3 to 4 stages (Ricci et al. 2011). The first stage is content analyzing, when the initial items descriptions are processed and transformed into structured items representations. Then the recommender system is learning profiles, that is collecting the user preferences, which can be obtained either implicitly or explicitly, as I have already mentioned, and generalizing this data in the way that it would be easier to compare profiles with items representations. The third stage consists of filtering, when the system is looking for the items that would match the profile representation best. Lastly, users feedback is often involved in the process of recommendation. Like users interests, it can be collected explicitly or implicitly. The feedback is then added to a user profile and is taken into account during subsequent recommendations.
1.5 Advantages and drawbacks
Among the advantages of content-based recommender systems, there are also a few worth mentioning (Ricci et al. 2011). Firstly, this type of recommendations is not influenced by the ratings provided by other users that play a great role in collaborative recommendation. Therefore, only the ratings of an active user are taken into consideration, which can possibly lead to more successful suggestions of the recommender.
Another important factor is transparency of content-based recommenders. In order to make recommendations of the system more reliable for users, it is often sensible to provide an explanation why a certain item has been recommended. Using a content-based approach, the system may accompany its recommendations with the features common for an item and the user profile, thus increasing the user's trust in the system. A collaborative recommender, on the other hand, cannot provide any clear reasons for its recommendations, while it relies only on the preferences of unknown users with similar tastes.
The main drawbacks of content-based recommenders include, for instance, limited content analysis (Ricci et al. 2011). It means that the system cannot process an infinite number of features, and the types of the characteristics it uses are also limited. Therefore, some items that would have been suitable for a user are ignored.
Another negative aspect of this type of recommenders is over-specialization. Content-based system can provide only the recommendations similar to the items the user liked before, and has no opportunity to suggest anything unexpected.
Moreover, while the collaborative systems have problems with recommending new items to their users, content-based recommenders need some time before they can recommend anything to a new user. This shortage may be partly overcome by introducing some initial questions about a user's preferences.
1.6 Advanced methods: Ontologies
When the items are provided in a form of text, rather good recommender systems can be built with extracting key words from texts and/or user profiles. However, there are certain strategies that help this simple approach to acquire “semantic intelligence” it lacks and to achieve better results (Ricci et al. 2011).
Using ontologies in a content-based recommender supplies the system with additional cultural and linguistic information about items and profiles (Ricci et al. 2011). Therefore, processing of items and profiles starts to resemble natural interpreting of language documents. WordNet is mainly used in recommender systems for word sense disambiguation and key-word support. For instance, this is the case for SEWeP (Semantic Enhancement for Web Personalization) (Ricci et al. 2011). Apart from WordNet, it uses its domain-specific ontology for semantic annotation of webpages. It is not the only recommender system that employs its own specially designed ontologies. Quickstep can serve as one more example (Ricci et al. 2011). This recommender uses its research paper topic ontology to enhance its suggestions of academic papers. The authors of (Ricci et al. 2011) assure that all the systems that exploit linguistic and/or domain-specific knowledge improve the quality of their recommendations.
1.7 Evaluation of RS
Before the evaluation of a recommender system, it is necessary to define which variables are important for providing recommendations (Jannach et al. 2011). If a variable has a constant value throughout the whole recommendation process or its value is controlled in the system design (e.g. if a variable is a recommendation algorithm), it is considered independent. Dependent variables are supposed to be influenced by independent ones, and the goal of evaluation is usually to determine the impact of independent variables.
A few types of evaluation experiments can be distinguished. Depending of the evaluation settings, there are lab studies and field studies (Jannach et al. 2011). It is claimed that people in lab studies are less motivated, and there is a chance that they would act differently in a real-world situation. Field studies, however, have disadvantages too, as there are too many factors involved, and researchers have no control over them.
The evaluation of a recommender system can be also done at different stages of its development (Ricci et al. 2011). In the design phase an evaluation experiment is usually conveyed on historical user sessions. The main purpose of the experiment at this stage is to check how suitable the chosen approaches are for a certain data. When the recommender system is ready, it is evaluated one more time, in order to make slight improvements or adjust independent variables. This type of evaluation is more likely to be conducted online or through a focused study on a small group of users, when an online experiment is too risky. A certain advantage of a focused study is a possibility to collect both quantitative and qualitative information about system performance.
2. Methods
The NYPL Digital Collections have their own API, which, among other ways, can be accessed with a background application, such as a Python script. It makes the process of downloading information about items simple and convenient.
The contents of the collections are being modified all the time. At the moment of downloading items' topics there were 765,395 items, about 52% of which were annotated by one or more topics.
2.1 Graph
There are 21,549 unique topics in the collections, while only 3% (539 topics) of them never occur together with other topics in the description of the item. That is why building a graph with weighted edges appears to be a good option for establishing connections between topics and learning about closely related ones. The weight of the edge depends on the number of co-occurrences of the topics.
After establishing connections of different degree between the nodes, a number of the most similar neighbors of an item can be extracted. However, firstly it is necessary to search for the exact items that match user's query. Therefore, both the topics and the user's query should be preprocessed. Preprocessing includes tokenization and deleting stop-words using NLTK package for Python, deleting punctuation marks and lemmatization. In my opinion, the most suitable tool for lemmatization English texts is TreeTagger. Surprisingly, the lemmatizer provided by NLTK relies on POS-tagging, being incapable of built-in POS-tagging. The only opportunity to use this lemmatizer is to perform tagging with other NTLK tools, although the tags of the lemmatizer and of the NLTK POS-tagger differ from each other significantly.
The case of the words is converted to lower in all the topics. Most of the topics are written with a capital letter, and it could cause the problem of mismatch in many situations. I believe that recall is more crucial than precision for fulfilling my purpose. Moreover, the amount of data is limited and losing the matching topics due to a different case could have a considerable negative effect on the results of the recommender.
After preprocessing the query, the program executes all the topics that contain the required word or phrase. However, there is no need in listing these topics, as the website of the NYPL Digital Collections provides good search facilities. The recommender should suggest the items that a user would not find using only the NYPL search. For that reason, all the graph neighbors of the executed topics are collected and sorted according to the edges' weights. The program takes into account only the neighbors that do not match initial user's query - in this case, the topic is not relevant to the user, as it would have been found at the NYPL website. Thus, the items with the top weights of the edges would be the closest related to the user's query, as they occurr with the topics that match the user's interests most often.
Nevertheless, graph can be a source of recommendations only in a limited number of cases, when the topics contain the word or phrase the user is looking for. Considering that there are some isolated nodes in the graph, these topics also should occur with others at least once. In the situation when the query does not meet these restrictions, alternative recommendation approaches should be provided.
2.2 VSM
One of the possible alternatives is using vector space models to determine nearest semantic associates for the user's query.
Vector space models (VSM) are based on the statistical semantics hypothesis stating that statistical information about word usage can reflect the meaning people intend to convey. There are three main types of VSMs, each based on a different kind of matrices. First, there are models using term-document matrices. They rely on the implication of the statistical semantics hypothesis called the bag of words hypothesis. It claims that the similarity between a query and a document can be determined by transforming them into bags of words. Document similarity is very often used in search engines, when it is necessary to find the websites most closely semantically related to the search query. (Pantel, Turney 2010)
Another type of VSMs is based on word-context matrices. The underlying hypothesis is called distributional and states that words used in the similar contexts have also similar semantics. These models are perfect for defining how close words are to each other semantically, which is usually done by calculating the cosine similarity between two vectors representing the words. There are many applications for word-similarity VSMs - for instance, word sense disambiguation or semantic role labelling. (Pantel, Turney 2010)
The third type of matrices is pair-pattern matrix, which is used to represent the similarity of relations between pairs of words. Two hypotheses are useful for these models. Firstly, there is an extended distributional hypothesis arguing that relations using the same pairs of words are similar, such as a dog bit a boy and a boy was bitten by a dog. Then, the latent relation hypothesis is applied. It supposes that the pairs of words have similar semantic relations if they are used with the same pattern - for instance, human : hand and animal : paw. The models for similarity of relations are employed, for example, for dividing words into semantic classes. (Pantel, Turney 2010)
All in all, VSMs give good results when it is necessary to find a similar word, phrase or document (Pantel, Turney 2010). My main purpose is to expand a search query, so that more recommendations could be provided. Therefore, the VSMs based on word-contest matrices would suit best. Such popular search engines as Google and Yahoo! Also use this approach for generating new terms that are closely connected to a search query (Pantel, Turney 2010).
The project WebVectors introduces a few trained VSMs for English and Norwegian, as well as a handy interface to access them. Above that, every model can be downloaded and used locally - for example, with a Gensim package for Python. There are four English models to choose from, all of them have different sizes, training parameters and document sources. The biggest model is Google News Corpus, containing about 2.9 million English words and phrases. However, the preprocessing for this model does not include lemmatization, therefore it is not very useful for lemmatized topics. Lemmatization is indeed crucial for the topic recommender, because, as I have already mentioned, the number of topics is relatively small.
The second biggest model is English Gigaword with the size of 314,815 different lemmas. Yet it has been trained on the newswire text data, which does not suit the topics format very well.
The most appropriate model for the topic recommender is, in my opinion, an English Wikipedia model. It contains 296,630 different lemmas, which is only slightly fewer than the English Gigaword. The English Wikipedia model has a bigger context window length, and its performance on the Google Analogy test set is higher than of all the other presented models.
Lastly, there is a model trained on the British National Corpus, but its size is almost twice as small as the size of the English Wikipedia model, and its performance on the Google Analogy test set is considerably worse. Thus, the most suitable model for the topic recommender is the English Wikipedia model.
In order to extract topics that are most relevant to the user's query, the program finds ten words or phrases that are most closely to the search query semantically. Then the program goes through these top-ten entities, from the most semantically close to the least one, and looks for the matches in the topics. The topics which contain the search query are ignored, in order not to provide redundant information, as it is done in the graph algorithm. The final result is a list of topics that can be recommended to the user, with the most relevant topics in the beginning and less relevant - in the end.
All the words in the English Wikipedia model are POS-tagged, and the most simple and guaranteed solution for searching for a query in the model is going through all possible tags. I assume that a user is interested in all the possible query-tag combinations, and, again, prioritize recall over precision. The model relies on Universal POS tags, so the following tags are used in the recommender algorithm: ADJ, ADV, NOUN, PROPN, VERB. All the closed classes tags are excluded, mainly because many of the corresponding words are deleted from the topics at the preprocessing stage. Moreover, words of closed classes are not typical for search queries, as well as interjections. The other tags are also least likely to occur in a query.
To sum up, a VSM can be successfully used for extending a search query both in the situation when a query was and was not found in the topics. For the topic recommender I use the VSM trained on the texts of English Wikipedia dump.
2.3 WordNet
Another useful tool that the topic recommender can benefit from is WordNet. WordNet is a lexical database where the synonymous words that belong to the same lexical category are combined into synonymous sets (synsets). Polysemous words are represented by different synsets. As every synset is accompanied by its definition and some usage examples, WordNet, provides great opportunities for word sense disambiguation. Indeed, it is widely used for this purpose, determining the right meaning by comparing the context of the ambiguous word and the usage examples of the possible options.
There are only four lexical classes of words included in WordNet: nouns, verbs, adjectives and adverbs. No words of closed lexical classes are taken into account.
The synsets are connected to each other by relations, which are different depending on a lexical class. The relations I use in the topic recommender include coordinate terms (that is, the words belonging to the same synset), antonyms, hypernyms and hyponyms. Noun synsets can have all the four relations, for example, for the word boy an antonym would be girl, one of the hypernyms for the synset the word belongs to would be male, and one of the many hyponyms for the synset would be scout. It is also true for verbs, for instance: verb walk has an antonym ride, the synset it belongs to (being the only synset member) has, among others, hypernym travel and hyponyms march and hike. There are also some other specific relations between noun synsets and verb synsets, such as meronym (X is a part of Y) or entailment (by doing X you must be doing Y) which are not implemented in the program because it would be too time-consuming.
For adjectives, only antonyms and semantically similar words are available: an antonym for ugly is, of course, beautiful, and similar terms include despicable and horrifying. The class of adverbs consists only from a few words, as the relations between English adjectives stay the same in most of the cases when the words change their class to adverbs, which can be explained by a simple derivational process. recommender knowledge wikipedia
The algorithm that extracts similar topics using WordNet is divided, therefore, in four parts: searching for synonyms, antonyms, hypernyms and hyponyms. The next part is performed only if the number of recommendations is under the threshold, because, comparing to previous methods used to produce recommendations, WordNet is more time-consuming. For this reason, the number of results is checked in the program for many times. After being extracted from synsets, lemmas are being searched in the topics, and the successful matches are returned in the final list of WordNet recommendations.
2.4 Public data sets: why they are not used in recommendations
Public data sets could also be considered as a possible source of information for recommendations. There are many options, as can be seen from the Linked Open Data project. This goal of this project is to connect identifiers of the same items from different public sources. By 2014, 570 data sets were linked to each other using OWL, a Semantic Web language for describing complex knowledge about things, groups of things and how they are related to each other. (Sletten 2015)
One of the biggest and diverse databases of human knowledge is Wikipedia. Though most information in Wikipedia consists of unstructured text, there is also other kinds of representing information - for example, textboxes or categories. The DBpedia project collects different kinds of structured information in a database and uses Resource Description Framework (RDF) to represent it. Among other sources, DBpedia is involved in the Linked Open Data project.
Another useful source of public data is Wikidata, which, in opposite to DBpedia, is constructed manually and can be edited by anyone, like Wikipedia itself. The information there consist mainly of items which have properties and corresponding values - for example, a person can be described by stating their place of birth or their profession.
Nevertheless, after exploring both tools and thinking of their possible applications to the topic recommender system, it was decided not to use them in the program. There are several reasons for that.
The first approach that was considered was to find an instance corresponding to a user's query, then establish a class of this instance and collect other instances of the class for recommendations. However, as far as I can see, classes tend to be enormous - for instance, a class Human used in Wikidata or smaller but still a very big class Actor in DBpedia. There are smaller classes, too, such as a class Card Game in DBpedia, but bigger classes prevail. Above that, one instance may be related to more than one class. Thus, listing all or some instances of the class most likely would not provide users with qualitative suggestions.
One more idea for applying Wikipedia knowledge databases was to collect the instances that have characteristics similar to the user's query. The problem is, one instance usually has many characteristics, some of which could be relevant for us (for example, a university the person went to) and some - not (cause of death). Moreover, users' queries can be very diverse - a user may ask something about a person or a place, but also about a planet or a historical event. Unfortunately, there is no possibility to establish the main characteristics for all the classes automatically.
Lastly, if a user's query is not found in the database, but it is a name of the class, the program could suggest looking for the topics that contain similar classes, a class-hyperonym and classes-hyponyms. However, I believe that, as the WordNet taxonomy already performs this task, there is no need to implement an analogous approach.
To sum up, such knowledge databases as DBpedia and Wikidata provide great opportunity for expanding the information about an item, but they are more applicable when the data is not too diverse and the exact classes of instances are determined.
2.5 Weights of recommender sources
The goal of a recommender system, as well as every search engine, is to provide a user with more relevant information first. Therefore, as in the topic recommender there are three sources of recommendation, it is important to define which of them are more and less reliable. In order to evaluate the sources, 20 words from the topics vocabulary were chosen. The choice was completely random, only a few restrictions were applied.
Firstly, the word with the frequency lower than the threshold were excluded in order to optimize graph recommendations. The threshold is used for the number of recommendations, so here it is assumed that all of the found nodes or some of them have at least one neighbor. The probability of finding an isolated node is rather low, considering there are only 3% of them in the graph. Therefore, even if some of the found nodes would be isolated, the number of neighbors is still very likely to reach the threshold. Without this restriction, the graph could have recommended too little number of nodes to evaluate its performance.
Secondly, only the words that had recommendations from two other sources were taken into account. It is not the case for every word. Some of them are not included in the vector space model based on the Wikipedia, some have little synonyms and other kinds of related words in WordNet. Of course, there can be situations when no recommendations from VSM and WordNet are produced, although sometimes there are simply not enough of them to make evaluations. Thus, exclusively the words that received a number of recommendations equal or closer to the threshold were considered.
After the recommendations for the words were collected, two assessors were asked to evaluate them with a following scale:
· 1 (bad) - when less than 30% of recommendations seem relevant to a user;
· 2 (normal) - a user thinks that from 30 to 70% of recommendations are suitable;
· 3 (good) - more than 70% of suggestions are believed to be useful.
The means of the evaluations produced by assessors are illustrated in the tables.
Table 1. Assessor 1
Source 1 |
Source 2 |
Source 3 |
|
2.4 |
1.85 |
1.9 |
Table 2. Assessor 2
Source 1 |
Source 2 |
Source 3 |
|
1.95 |
2.05 |
2 |
According to the tables, their evaluations for the source number 3 are similar, but they do not agree on the performance of source 1 and source 2. In any case, in order to use the information about assessors' opinion on the source 3, the inter-rater agreement should be measured.
Calculating simple percent agreement does not always provide reliable results, as there is always the possibility that the raters agreed by chance. Kappa coefficient compares the actual agreement to the agreement that could occur by chance.
Kappa statistic can take the value in the range from 0 to 1, where 0 means no agreement, and 1 reveals perfect agreement. The popular scale for the Kappa coefficient is a five-grade scale, where the interval 0.81-0.99 denotes almost perfect agreement, and the values in 0.01-0.20 - slight agreement.
In evaluations with many categories, it is also possible to count weighted Kappa. As Kappa coefficient is a popular measure in medical literature, let us consider a medical example. It could be that one radiologist assumes the mammogram to be normal, and another one - to be benign, but it could also happen that the categories chosen by different doctors are normal and cancer. The weighted Kappa would be especially useful in such cases, as it assigns less weight to agreement when categories are further from each other on the scale. [Viera, Garrett 2005]
The summary tables for the assessor 1 and assessor 2 are represented below.
Table 3.
Assessor 2 |
|||||
Assessor 1 |
bad |
normal |
good |
Total |
|
bad |
9 |
5 |
2 |
16 |
|
normal |
6 |
16 |
3 |
25 |
|
good |
2 |
5 |
12 |
19 |
|
Total |
17 |
26 |
17 |
60 |
Table 4.
Agreement |
9 |
16 |
12 |
37 |
|
By Chance |
4.53 |
10.83 |
5.38 |
20.74 |
The Kappa coefficient in this case equals to 0.41. It is considered moderate agreement on a scale of Kappa interpretation.
All in all, as the assessors agreed only on evaluating one of the sources, and the Kappa coefficient of agreement is relatively low, the recommendations can be produced in a random order. However, randomizing is not the best option, while the recommendations provided by one source could look rather similar - for example, “Evening wear -- 1920-1929” and “Evening wear -- 1910-1919”. Obviously, it is more convenient for a user when such topics occur close or next to each other.
For this reason, it is necessary to rank the sources without taking into account their evaluation. I suggest the following ranking:
1. Graph
2. VSM
3. WordNet
It is mostly based on my own experience with the system.
3. Results
3.1 Interfaces
To make a recommender system accessible for general public, it should be provided with a convenient user interface. One of the ways to help a recommender find its user is to create a website.
The website for the NYPL topics recommender was created using Flask package for Python and HTML and CSS templates. The use of Bootstrap, the free front-end framework, has significantly simplified the development of the website, helped to achieve user-friendly interface and automatically adapted the website for the use on screens of various sizes - from laptops to mobiles.
The website interface and all the code used in the project is currently available at the address https://github.com/AlinaBaranova/Recommender_system. The website is going to be published in the nearest time.
Adapting a website for mobile phone usage is very helpful, however, there is one more possibility for gaining more users. As messengers are becoming more popular and are even being accessed by more users than social networks nowadays, implementing a recommender system in a chatbot seems to be a really useful option. In contrast to creating a website, a programmer does not have to take care about design - messengers provide their own interface. As soon as the creator writes an algorithm for a chatbot, it is ready to be used. Such chatbot was created for a popular messenger Telegram. Its nickname is NYPL Topics Recommender.
3.2 Clusters
With the help of Gephi software for exploration of graphs, the graph of topics was clustered into 1014 clusters. The distribution of the elements may be seen at the following diagram:
Diagram 1.
There are 539 isolated nodes in the graph, each of them represented by its own cluster. For this reason, just over half of the clusters contain two and more elements.
The main idea under clusterization consisted of combining topics with similar semantics into the same groups. One of the ways to count semantic similarity between topics in a cluster is to use VSM for calculation the cosine similarity between pairs of vectors representing these topics.
First, it was necessary to represent each topic with a vector. It was done by computing the average vector for all the words in a topic, excluding stop-words. After that, the cosine similarity between all the pairs of vectors in a cluster was calculated. The average similarity is illustrated by the barplot.
Diagram 2.
As it can be seen from the diagram, the average similarity is not very high - most of the clusters fall in the groups of 0.2 and 0.3 cosine similarity. I think it is the consequence of their sizes - the previous graphic shows that many clusters contain more than 50 or even 100 elements.
Subsequently it would be interesting to explore the clusters in more detail and find a way to associate them with available classifications. At the moment, they can be observed via Gephi or other similar framework.
Conclusion
Summarizing the above, the recommender system for NYPL Digital Collections was designed. One of the sources of recommendations was a graph, where the topics were its nodes, and the co-occurrences of topics - its edges. All the edges received weight depending on the number of such co-occurences. The graph recommends the user topics that are the closest neighbors of the topics that match the user's query.
Another strategy that was applied in the topics recommender is finding semantically closest to the query words. The vectors space models for English language provide great opportunities for extracting the words that are semantically similar to the users query. The extracted words were then used for finding the relevant topics.
The last approach for finding recommendations and the second way to expand user's query consisted of finding synonyms, antonyms, hyponyms and hypernyms via lexical database WordNet. While going through the corresponding synsets, the matches between the words and topics were discovered. These topics were then recommended as possibly interesting for a user.
References
1. Jannach et al. 2011 - D. Jannach, M. Zanker, A. Felfernig, G. Friedrich. Recommender Systems: An Introduction. New York: Cambridge University Press, 2011.
2. Pantel, Turney 2010 - P. Pantel, P. D. Turney. From Frequency to Meaning: Vector Space Models of Semantics // Journal of Articial Intelligence Research 37, 2010. P. 141-188.
3. Ricci et al. 2011 - F. Ricci, L. Rokach, B. Shapira, P. B. Kantor. Recommender Systems Handbook. New York: Springer, 2011.
4. Sletten 2015 - B. Sletten. Data integration at scale: Linked Data. Apply principles for connecting large, independent web data sets. June 22, 2015. https://www.ibm.com/developerworks/xml/library/wa-data-integration-at-scale_linked-data/index.html
5. Viera, Garrett 2005 - A. J. Viera, J. M. Garrett. Understanding Interobserver Agreement: The Kappa Statistic // Family Medicine 37(5), 2005. P. 360-363.
Размещено на Allbest.ru
...Подобные документы
Процессоры Duron на ядре Spitfire (Model 3), Morgan (Model 7), Applebred (Model 8), Mobile Duron Camaro. Схема материнской платы EP-8KHAL+. Микросхема "Северный мост". Звуковой чип ALC201A. Конфигурация системной памяти. Регулятор заглушки шины RT9173.
курсовая работа [3,6 M], добавлен 26.03.2013Концептуальна модель бази даних, визначення зв’язків між ними, атрибутів сутностей їх доменів. Створення ORM source model та Database model diagram для бази даних "Автотранспортне підприємство". Генерування ddl-скрипта для роботи в СУБД SQL-Server.
курсовая работа [47,3 K], добавлен 17.10.2013IS management standards development. The national peculiarities of the IS management standards. The most integrated existent IS management solution. General description of the ISS model. Application of semi-Markov processes in ISS state description.
дипломная работа [2,2 M], добавлен 28.10.2011Теоретические аспекты использования Infrastructure Library информационных технологий. Планирование процессов, ролей и видов деятельности. Определение связей и необходимых видов взаимодействий в организации. Проблемы внедрения Infrastructure Library.
курсовая работа [69,9 K], добавлен 22.05.2017Component Object Model. Объектная модель Microsoft. Пути решения проблемы повторного использования кода. Понятие интерфейса. Двоичный стандарт для программных компонентов. Многоразовое использование программного обеспечения.
контрольная работа [16,2 K], добавлен 01.08.2007Текст як базова одиниця комп’ютерно-опосередкованої комунікації. Гіпертекст як прояв тексту в мережі Інтернет. Гіпертекстуальність як загальний, словниковий та лінгвістичний термін. Wikipedia як найпопулярніша онлайн енциклопедія у всесвітній павутинні.
курсовая работа [1,1 M], добавлен 20.05.2015Practical acquaintance with the capabilities and configuration of firewalls, their basic principles and types. Block specific IP-address. Files and Folders Integrity Protection firewalls. Development of information security of corporate policy system.
лабораторная работа [3,2 M], добавлен 09.04.2016Data mining, developmental history of data mining and knowledge discovery. Technological elements and methods of data mining. Steps in knowledge discovery. Change and deviation detection. Related disciplines, information retrieval and text extraction.
доклад [25,3 K], добавлен 16.06.2012Информационный поиск: векторная модель (vector-space model). Ранжирование документов по мере их соответствия запросу. Традиционные методы оценки эффективности поиска. Концептуальное индексирование. Разрешение многозначности. Board: значения и иерархия.
презентация [95,2 K], добавлен 01.09.2013Сrime of ciber is an activity done using computers and internet. History of cyber crime. Categories and types of cyber crime. Advantages of cyber security. The characteristic of safety tips to cyber crime. Application of cyber security in personal compute
презентация [203,5 K], добавлен 08.12.2014Модель взаимодействия открытых систем Open Systems Interconnection Reference Model. Основные особенности модели ISO/OSI. Характеристики физических сигналов, метод кодирования, способ подключения. Канальный уровень модели ISO/OSI. Передача и прием кадров.
презентация [52,7 K], добавлен 25.10.2013Разработка цифрового нерекурсивного и рекурсивного фильтров с заданными параметрами. Проектирование фильтра в программе Matlab с помощью утилиты fdatool. Построение структурной схемы во вкладке Realize model. Общий вид линейного разностного уравнения.
курсовая работа [2,9 M], добавлен 19.03.2012Идентификация реальных объектов, выбор и обоснование вида моделей. Динамическая система. Периоды и фазы клеточного цикла, контрольные точки, нарушение, значение, продолжительность. Регуляции перехода фаз. Компьютерное моделирование системе в пакете MVS.
дипломная работа [2,0 M], добавлен 17.02.2014Случаи использования PHP фреймворка. Обзор современных фреймворков. Выбор фреймворка для разработки сайта. Поддержка баз данных и сообщества. Model View Controller архитектура. Скорость развития фреймворка. Наличие встроенных javascript-библиотек.
курсовая работа [1,8 M], добавлен 31.05.2012Понятие и условие устойчивости бистабильной системы. Исследование модели "нагреватель - охлаждающая жидкость", построение фазового портрета стационарных состояний нагревателя. Компьютерное моделирование данной системы в пакете model vision studium.
курсовая работа [1,1 M], добавлен 07.06.2013Создание математической модели бистабильной системы "нагреватель-охлаждающая жидкость". Решение задачи Коши для дифференциального уравнения второго порядка. Обзор особенностей компьютерного построения модели динамической системы развития двух популяций.
контрольная работа [1,1 M], добавлен 20.10.2014Program game "Tic-tac-toe" with multiplayer system on visual basic. Text of source code for program functions. View of main interface. There are functions for entering a Players name and Game Name, keep local copy of player, graiting message in chat.
лабораторная работа [592,2 K], добавлен 05.07.2009Математическая модель исследования топологии поверхностей электронно-проекционным муаровым методом. Основной алгоритм программы, модулирующий муаровый эффект. Последовательность действий, обработка изображения. Интерфейс модуля model, рабочий растр.
курсовая работа [1,3 M], добавлен 28.01.2014Оценка качества поисковых систем. Индексирование по ключевым словам. Внутренние представления запросов и документов на информационно-поисковом языке. Способы улучшения поиска при помощи тезаурусов и онтологий. Ранжированный поиск (vector-space model).
лекция [31,5 K], добавлен 19.10.2013Принципы построения систем с переменной структурой для управления свободным движением линейных объектов с постоянными параметрами. Разработка модели системы с переменной структурой с применением инструментов Model Vision Studium и Simulink пакета MathLab.
дипломная работа [4,3 M], добавлен 26.10.2012