Characteristics of social network

The main application tasks developed for the analysis of social networks. The analysis of approaches to software development. Selection of project management tools. Especially the use of the programming language and integrated extension environment.

Рубрика Иностранные языки и языкознание
Вид дипломная работа
Язык английский
Дата добавления 22.09.2016
Размер файла 1,1 M

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://www.allbest.ru/

NATIONAL RESEARCH UNIVERSITY

HIGHER SCHOOL of ECONOMICS

Faculty of Business and Management

School of Business Informatics

Author: Daniil Pronin

MASTER THESIS

Master's programme «Big Data Systems»

DEVELOPMENT OF SOFTWARE FOR MARKETING ANALYSIS AND INTEREST ANALYSIS IN VKONTAKTE SOCIAL NETWORK WITH APPLICATION OF LINK ANALYSIS

Moscow 2016

CONTENTS

INTRODUCTION

1. TASK DEFINITION

2. THEORETICAL BACKGROUND

2.1 Social Networks and Social Networking Services

2.2 Social Graph

2.3 Interest Graph

2.4 Related Works

2.5 Data definition

3. WORK'S SCIENTIFIC VALUE

4. METHODOLOGICAL VARIETY

4.1 Network measures

4.2 PageRank

4.3 SimRank

4.4 Binary coefficients

4.5 Software development approaches

5. TOOLS SELECTION

5.1 Programming Language and IDE

5.2 Packages and libraries

5.3 Project management tool selection

6. PRACTICAL IMPLEMENTATION DESCRIPTION

6.1 Requirements formulation

6.2 UML

6.3 Programming an application

7. PRACTICAL RESULTS DESCRIPTION

7.1 Application demonstration

7.2 Restrictions

7.3 Measures comparison

8. TASK COMPLETION PROOF

CONCLUSION

BIBLIOGRAPHY

ABSTRACT

APPENDIX

INTRODUCTION

Online social network boom in 20th Century brought a new era of communication, where people do not need to see each other personally to have a talk. Nowadays people that use social network services may have hundreds and thousands of friends, followers and followings (depending on the type of social network) all over the world. The main types of social networking services are those that contain category places (such as former school year or classmates), means to connect with friends (usually with self-description pages), and a recommendation system linked to trust [1]. In most cases social network service combine all these types (Facebook, Vkontakte, Twitter etc.).

Emerging of social network services not only gave people an opportunity to communicate, share content, unite into groups of interests. It gave government and business a huge source of information for analysis. Though social network theory itself emerged in the first half of 20th century, it was hard to apply it due to the limitations of data and restrictions to data access. Today technologies and data sources from social network services allow data scientists to build recommendation systems, analyze the community formation, predict trends and society sentiments and conduct lots of other researches based on Terabytes of data.

The major tasks of applications developed for social network analysis are:

· to monitor and analyze of social networks;

· to forecast and control social networks

Monitoring includes raw data retrieval and structuring. User profiles, connections, posts and messages are accumulated. The opportunities of the system are much defined by the variety of the data used and by the processing mode. Systems. Which support real-time monitoring, are harder to implement, than the systems, which use retrospective data collection.

Analysis contains several steps of processing. First of all, it is the calculation of basic indicators, which allow to answer simple quantitative questions like “how many messages has the user A posted”. Then the detection of statistical and structural patterns in data gives an understanding of social network type.

Forecast is available if a mathematical model is identified.

Control is provision of targeted actions on social network for transiting informational processes into desired state.

Author has set the goal to develop the system for social network monitoring and analysis. This system presents a recommendation system based on links analysis. The idea was to create a system, which would take a user profile as an input parameter and provide links analysis and interest analysis of the input user and his/her friends, presenting a rank of interest similarities among user's friends. The rank of similarities would be useful for targeted advertisement in Vkontakte social network, considering the appearance of the features like direct conversation between user and organization (which has a type `group' in vkontakte) and the opportunity for organizations to post goods and prices on group's page.

The assumptions for this research are next:

· the more common interests and friends the observed profile has with input user, the more rank of similarity is;

· the stated in profile interests are actually the interests of a user;

· an organization running the program has the Vkontakte id of the customer, which has recently purchased the product/ the service;

· the target audience of a product/ service is not restricted (it is not construction materials, specialized equipment etc.) .

The remainder of this thesis is organized as follows:

Task Definition presents the major goal of the research and tasks for the goal completion. Then Theoretical Background brings the historical summary of the development of social networks theory and social network services. After historical summary main concepts and conditions of social networks are described and related works are overviewed. The next part, Work's Scientific Value, contains arguments for the thesis's relevance and novelty. Methodological Variety includes the existing ways for similar problem solving and their consideration. In Tools Selection the tools, considered by the author for the thesis's goal achievement, are introduced, with justification of the choise. The next part, Practical Implementation Description, contains the description of major algorithms, conceptual diagrams and extractions from solution programming code. The following Practical Results Description introduces the final form of application with its functionality demonstration. Then, Practical Results Analysis section describes the quantitative characteristics of application (processing time etc.), the qualitative analysis (manual comparison with the actual data of users in Vkontakte etc.) and limitations of developed application. Task Completion Proof shows the completion of tasks defined in the section Task Defintion with references to the sections presented in this thesis. Finally, Conclusion section summarizes the work undertaken, notes the areas of application, and reveals limitations and possible future development of the application.

1. TASK DEFINITION

The major goal of this thesis is to develop software for marketing analysis and interest analysis in Vkontakte social network with application of link analysis.

To achieve this goal, the next particular tasks should be completed:

· An overview existing methods and ways of links analysis

· A research possible scientific value of work based on current state of business in this area

· A development of general algorithms

· Software tools selection

· Algorithm's implementation using existing tools

· Solution testing and practical applicability demonstration

2. THEORETICAL BACKGROUND

2.1 Social Networks and Social Networking Services

Theory of social networks was founded by Ray Solomonoff and Anatol Rapoport in 1951. Eight years later Paul Erdos and Alfred Renyi continued to develop this theory with articles, published in 1959-1968 [2]. These articles presented the principles of social network formation. The term `social network' itself was coined by John A. Barnes, `Manchester School' sociologist.

In the late 1960s Duncan J. Watts and Steven H. Strogatz introduced mathematical theory of social network development and the concept of clusterization coefficient, which meant the level of propinquity between inhomogeneous groups [3].

Thus, by 1970s the final set of sociological and mathematical researches had been formed. This set of researches has become a scientific foundation of statistics and social network analysis.

One of the first social networking service was developed in 1971 by military forces in ARPA net. [4]

In 1988 IRC - Internet Relay Chat Protocol was introduced by Finnish student Jarkko Oikarinen. IRC connected users and gave them an opportunity to communicate in real-time. Then Khaled Mardam-Bey, a British programmer of Palestinian an Syrian Origin, developed mIRC - IRC client for Microsoft Windows OS. The number of downloads of mIRC exceeded 40 million times and in 2003 in was included in top 10 most popular Internet Applications rating created by Nielsen/NetRatings. Nowadays mIRC is user in US military forces as `tactical chat' for communicating practical information on the battlefield. [5]

November 1996 gave the world famous internet messenger ICQ [6]. It was developed by high school students from Tel Aviv.

The creation of website Classmates.com [7] was an important event in the development of social networking services. This website cannot be fully considered as a social networking service of current understanding, as there was no opportunity to create profile and to add friends. The functionality of Classmates.com was the provision of information about educational institutions and its students. The project expanded rapidly and still exists.

At 1997 the first social network service of modern type was called SixDegrees.com [8] was launched. It took the name after the six degrees of separation concept [9] (a theory that it is enough chain `a friend of a friend' ranked 6 to connect any two people in the world). This project had functions of creating own profile page, adding friends and searching for friends. It had been operating till 2001. According to Google blog, the founder of SixDegrees.com connected the project closure with its time advancing (in 2000 less than a half of US citizens had internet, so the number of friends was too small to make communication on website interesting).

In 1997-1999 several other social network services similar to SixDegrees were developed (AsianAvenue, PlanetAll, Cyworld etc.).

On 18 March 1999 Livejournal was introduced by an American student-programmer Brad Fitzpatrick - the first social network service, which allowed creating communities and chat conversations in them [10].

In 2001 Ryze - a resource for business contact searching emerged [11]. In fact, it gave a rise to the development of currently popular LinkedIn. Then in 2002 Friendster was developed by Johnathan Abrams. It was the first network adapted for friends and acquaintances search among the lists of own friends instead of suggesting strangers as potential friends, as usual meeting websites do. This innovation made Friendster rather popular during the first months of its existence. In 2002 previously mentioned LinkedIn was developed. It was run in May 2003 and took leadership in the area of business social network services.

Myspace was introduced in 2003 as well [12]. It won lots of users because of the opportunity of personal profile creation, convenient appearance settings, interest communities, photo, audio and video placement, own blog page. All these features made MySpace the most popular social network service in 2006. It was beloved by famous rock bands, many musicians started to use MySpace for self-presentation, while their fans used it for communication with idols.

The year 2004 was marked by the emergence of number of social network services[13]:

· aSmallWorld - private social network with invitation required

· Facebook - world famous social network developed by Mark Zuckerberg and team.

· Piczo - a social networking and blogging website for teen

· Dogster - a social network service for dog-lovers. Has a cat-oriented sibling called Catster

· Mixi - Japanese social networking site base on communities and common interests

· Multiply - file sharing social network with the ability to upload media (photo, video, music), recipes, calendar and blogging

· Dodgeball - location-based social networking service.

· Flickr - social website for image and video hosting

· Grono.net - Polish social networking service.

In 2006 Jack Dorsey launched Twitter [14], which became a fast-growing social networking service.

Finally, history comes to the creation of two major Russian social networking services - Vkontakte [15] and Odnoklassniki [16] in 2006.

Vkontakte was first launched for beta testing in September 2006 by Pavel Durov. In the beginning the user registration was limited and required invitation. On January 2007 Vkontakte was incorporated as LLC and in December 2008 it became the leader of Russian social networking market.

Nowadays Vkontakte is an open social networking service allowing user to create profiles, upload and share photos, videos and audios. Users may create groups of interests and events, make posts and send instant messages.

2.2 Social Graph

Social graph represents a system “user+surrounding+connections+interests” in social network [17]. Thus, every user has a number of social graphs, which is equal to number of social networks, in which the user operates.

The term was popularized at the Facebook F8 conference on May 24, 2007, when it was used to explain how the newly introduced Facebook Platform would take advantage of the relationships between individuals to offer a richer online experience[18]. Social graph completely characterizes a person, his/her interests and relationships. Social graphs of users are mostly helpful for marketers, as they give the information about user in the role of consumer (interests, subscribed groups etc.). Moreover, social graphs give an opportunity to expand the number of potential clients through user's friends in social network. For instance, exploring social graph of a user gives an understanding of which products he/she may be interested in. After exploring user's links (friends, followers, relatives etc.), it can be inferred how much the user is promising from the perspective of additional customers attraction.

Metrics are the numerical characteristics of social objects, groups of objects and their connections. These metrics are used for social network analysis.

The metrics are divided into three types:

1. Metrics of Relations represent the pattern of relationships between one social object and others. These metrics include:

a. Homofphily - a degree, in which user forms a relationship with similar users. Similarity may be defined through gender, age, social status, educational level etc.[19]

b. Multiplexity - number of `multiplex' links, which users have. For example, two users, which are friends and work together, will have multiplexity equal 2.

c. Mutuality/Reciprocity - a degree, in which users interact with each other, response to each other.

d. Network Closure - a degree, in which user's friend are friends to each other. [20]

e. Propinquity - a tendency of users to have more links with other users, which are geographically closer.

2. Links metrics represent the features of links for separate social objects and the whole social graph. Linls metrics include:

a. Bridge - A user, who provides the only connection between the two other users. A path between two users through a user, who is bridge, will be a shortest path.

b. Centrality is a degree, which represents the importance and the effect of separate user (or users cluster) inside of the graph. [21]

c. Density is a percentage the proportion of direct links in network in relation to the total number of possible links. [22]

d. Distance - a minimum number of links required to set a connection between two separate users.

e. Structural holes - the absence of connection between two parts of social network

f. Tie strength is defined by linear combination of time, closure and mutuality. Homophily, Propinquity and transitiveness mean strong ties, while bridges mean weak ties.

3. Segmentation metrics display characteristics of a social graph, divided into distinctive segments:

a. Cliques - group in which all users have direct links (all nodes of the graph are interconnected) [23]

b. Social circles - group in which direct links are not necessary [24]

c. Clustering coefficient - the likelihood that two different users, connected with specific individual, are also connected. The higher clustering coefficient, the higher group closure.

d. Cohesion - the degree, in which users are interconnected with common link, forming social cohesion. Structural cohesion means that removing of several users leads to group break.

2.3 Interest Graph

Interest graph is an online representation of specific user's interests based on his/her social network activity. Usually the nodes of the interest graph are person's interests; nevertheless profile of another person may also represent a node of the graph. Edges of the graph interconnect persons and interests, meaning the relations between nodes. Interest graph may help to understand person's intentions, what person wishes to buy, what places the person wants to visit, what other user profiles person is interested in and even who he is ready to vote for.

The definition of interests graph is simple:

Let interests graph be represented by

,

where V is a node set of the graph:

· An interest

or

· A person

E - a set of edges, which represent link existence between nodes.

For instance, pic.1 represents interest graph G1:={V,E}, in which V = {Ivan, Alex, Ice Hockey, Acoustic Guitar, Sport Cars, Soccer}, E = {e1, e2, e3, e4, e5, r6}.

Interest graph may have various links types, which allow to exceed usual social networks. For example, if a person needs to find answer on the topic he/she is interested in, in case person's friends cannot give an answer, he/she can use one of the next types of links [25]:

· Person-person (users in social network may interact directly)

· Person-interest (the thing with which user interacts in social network)

· Interest-interest (similar interests may be connected

For specific purposes, i.e. for content delivery network organization, directed interest graph may be used, in which edge from A to B means that A is interested in receiving content from B.

Picture 1.

Interest graph may also be represented as weighted graph. In this case edges mean the strength of interconnection between edges. For such graph construction in the beginning the assumption is given that interconnections have equal strength, i.e. interest in hockey and music is not known and interconnection is set as infinite number. Then, if it is observed that people interested in hockey behave similarly to people interested in music, weight of edge between the nodes representing these interests will be reduced.

There are several ways of interest graph utilization. Combined with social graph, interest graph may be used for connection establishment between users. Moreover, it may be used for improvement of different recommendation systems embedded in modern social networks (friends recommendation, music recommendation, community recommendation etc.).

In business perspective it may be widely used in marketing analysis for project audience analysis and further project promotion [26]. It is also effective for sentiment analysis and targeted advertising based on his interests [27]. For instance, using interest graph, Twitter has an opportunity to make advertising aimed at a particular user based on his interests [28]. Moreover, interests graph may be utilized for product creation based on the consumers' needs; it helps to determine features and opportunities needed to be provided in the next versions.

2.4 Related Works

Running products

Currently Russian developers suggest a number of products for social network mining and analysis. Generally these products offer the analysis of Vkontakte communities or public pages.

Popsters (http://popsters.ru)

An instrument for community or public pages analysis, which computes the degree of community audience involvement. `Popsters' has a feature of posts sorting (i.e. by the number of likes and reposts). The major social networks used in Russia are supported (Vkontakte, Facebook and Odnoklassniki).

SocialStats.ru (http://socialstats.ru/)

Free web application for Vkontakte, which offers detailed statistics of photo and videoalbums, post on the wall and friend lists. Used for user pages and community analysis.

JagaJam (http://www.jagajam.com/ru)

A service for communities and groups analysis. Provides user with deep statistics, has the ability to analyze involvement, content quality and other indicators. Also there is a tool for community comparison and audience intersection searching.

Allsocial.ru (http://allsocial.ru/communities)

A service for public pages analysis. Gives an information about coverage, visitors, audience growth and advertising price

Reposts tree (http://dcpu.ru/vk_repost_tree.php)

A service for repost analysis. May be useful for marketers to monitor effectiveness of viral marketing campaigns.

Postee.ru (http://postee.ru)

This service allows community audience exploring and analysis. Postee.ru has a sorting system, which allows to watch most popular posts of any community. Moreover, it has a powerful tool for diagram building, able to guild a diagram of likes, reposts and comments, which were made by the members of the community.

CleverPub (https://cleverpub.ru)

Cleverpub is a paid tool for community administration support, where a community administrator can view a group statistics, monitor advertising quality an autoposting.

Talk to Friends App (http://vk.com/talk_to_friends)

An application-counter, which indicated a user, who got the highest number of likes and reposts after reposting initial post.

Media-vk.ru (http://media-vk.ru)

A service for group and event analysis. The distinctive feature is that media-vk.ru can build typical portraits of group/event members, build diagrams based on audience data analysis and show top-30 other public pages, which the observed members are subscribed to

Researches

Most of current academic works are based on Facebook analysis, as Facebook is worldwide largest and most famous social network.

Thus, J.Ugander et.al [29] made a complete social graph analysis in their research, studying the social graph of active Facebook users. Their work included global graph characterization, which has shown high degree of graph connection and confirmed `six degrees phenomena' [9]. Also the authors of the research have shown the structure density of graph neighborhoods of users and characterized the assortativity patterns, which present in the graph. This resource demanding research was made on Hadoop cluster with 2,250 machines, using the Hadoop/Hive data analysis framework developed at Facebook.

Xiao Han et al.[30] have proposed CSD - Community Similarity Degree - relying on principle that the higher CSD the community has, the more effective it for recommendation. CSD is a degree of interests similarity in community. The authors made some intuitions about the new metric CSD: CDS depends on number of common distinct interests among community members and on the absence of common interests among distinct members. They have discovered that interest-based communities (communities, in which all users have at least one common interest) tend to have higher CSD, than friend or location based communities.

Spertus et al. [31] have used collaborative filtering approach based on the per-community basis, which means that the more common users communities have, the more similar they are (overlapping membership).

Kim et al.[32] exploited TF-IDF algorithm to calculate term frequencies of nouns and calculated number of `Likes' in Facebook. Authors used Facebook Open API to collect information from Facebook. They deleted stopwords from posts and brought words to native form using Python Natural Language Toolkit. After that they counted Term Frequency of each noun, considering it the interest, and weighted them with formula:

(2)

As a result, weighted rank of user interests was achieved.

Hong et al. [33] have developed MyMovieHistory - a movie recommendation system based on social affinities among users. They used a watched movie history of each users, extracted the features of the films (leading actor, director, genre etc.) and weighted them using TF-IDF. Then they used Jaccard and Overlap coefficients to explore movie similarities and counted affinity between based on similarities of movies they watched in the past.

Islam et al. [34] proposed a method for product selection problem, called k-MPP - k-Most Promising Products. The data exploited in their research was divided into three groups: P - a set of all the products in the market; C - a set of products preferred by customers; Q - a set of products, which the company can offer. They have designed index structure for dataset partitioning and built Dynamic skylines [35] and Reversed Skylines for each partition [36]. The output of their algorithm was the subset of products from Q, which are worth to promote to customers.

Recommendation systems

Recommendation systems are extremely promising for marketing in OSNs by providing users with suggestions, such as what products to purchase, what movies to watch or what books to read [37]. Much work proposes various approaches (e.g., hierarchical Bayesian model [38]), trust circle-based model [39] semantic similarity-based model [40]) to provide personalized recommendations to users. Such recommendation systems are normally classified into three categories according to the ways of recommendation, including content-based, collaborative and hybrid recommendation approaches [41]. Most of these systems concentrate on recommendation for an individual user [42].

Similarity problem

Evaluating similarity is a practical and fundamental problem with a long history, which serves in various research domains such as geographic information science[43], biology [44] and decision-making [45]. In OSNs, a series of classical metrics, including overlap, cosine similarity, Jaccard similarity, Pearson correlation coefficient, etc., are employed to estimate the strength of user relationships, the similarity of users' tastes/interests, and the resemblance of users' background [46]. To recommend social events with holding a user's home location, the location similarity is calculated by weighted cosine similarity taking into account the common events that users from both locations have attended[47]. Besides, Han et al. [48] study similarity between two users by both common friends and common interests and show that friends generally share more interests than strangers. Pearson correlation coefficient is rather popular in collaborative filtering recommendation systems as it subtracts the average rating score from each rating, thereby eliminates the individual subjective differences[49]. Semantic objects, such as comments, posts, answers to questions,

nowadays. Estimating two users' similarity by their semantic relatedness is a fundamental task, which can in turn support a great number of applications (e.g., recommendation system, information retrieval, and link prediction)[50]. Accordingly, similarity metrics, such as mutual information[51], Lin's descriptive similarity[52], and maximum information path [53], are proposed to capture the structural information between semantic objects. Besides, a collection of global structural similarity metrics (e.g., Katz, PageRank) are proposed to capture the global topology information based on structural network. These metrics are widely- used to measure the similarity in link prediction, trust estimation, and community detection. Backstrom & Leskovec [54] calculate PageRank score to predict and recommend links in a supervised way. Kusumoto et al. [55] propose a fast and scalable algorithm to compute the top-k similar nodes for a given node in terms of the SimRank metric; while Tao et al. [56] design an efficient algorithm to select the k most similar pairs of nodes with the largest SimRank similarities among all possible pairs. Zhang et al. [57] use the idea of random path to quickly select the top-k similar nodes for a given node in a huge network and applies this method in two applications: identity resolution and structural hole spanner finding.

Vkontakte Analysis

Currently there are not many works devoted to the Vkontakte data mining and links analysis available in scientific society. Most of such works are inspired by Facebook analysis [58]. For instance, Wolfram Alpha has created a personal analytics service for facebook [59]. This service overviews the statistical information about personal user page. The title page is presented in Pic.2.

Inspired by the job of Wolfram Mathematica in Facebook, V.Glagolev has developed a Vkontakte links analysis application [60]. Glagolev used VK API for interaction with Vkontakte. L.Tonkih repeated Glagolev's research using Python 3.4 and d3 library[61]. Both applications had similar algorithm:

· Creation and authorization of a VK Standalone Application

· Data retrieval

· Social Graph visualization

As a result, both authors had similar graphs (Pic.3 a,b).

Picture 3a.Wolfram Alpha Picture 3b.Python and d3

2.5 Data definition

As the present thesis aims to analyze data from Vkontakte social network, the major data structure in Vkontakte should be presented [62]

1. Object `user' - the main object, which will be exploited during the research. Object `user' possesses the following fields (Table 1 introduces only fields potentially needed for analysis, similarity measure and visualization)

Table 1

Field name

Format

Description

Common fields (used in methods users.getSubscriptions, groups:getBanned)

Id

Positive integer

User identified

first_name

String

User first name

last_name

String

User last name

deactivated

Values: deleted/banned

Returned in case of blocked or deleted user page. Extra fields are not returned

hidden

Value: 1

Returned in case of trying to get access to object without access-token, when user set a parameter “Who can see my page in the internet” the value “Only Vkontakte users”.

In case hidden is returned, extra fields are not returned.

Optional fields, set by parameter `fields' (used in methods users.get, friends.get, users.search, users.getFollowersfriends.getByPhones,friends.getSuggestions)

photo_id

user_id+'_'+photo_id

ID of main profile photo (photo, which represents user's page and is users avatar

Verified

Binary

1 if page is verified

0 if page is not verified

Sex

Values: 1, 2, 3

A sex of user, where 1 = female, 2=male, 0 = undefined

Bdate

Date (DD.MM.YYYY or DD.MM)

Date of user's birth. The format depends on birth year availability.

City

Object

City, which is stated in `Contacts' section of user.

Id (integer)

City identification number

Title (String)

City title

Country

Object

Country, which is stated in `Contacts' section of user.

Id (integer)

Country identification number

Title (String)

Country title

Universities

Array

A list of universities user have visited

University

Object

Id

Positive integer

An identified of university

Country

Positive integer

An identifier of country, where university is situated

City

Positive integer

An identifier of city, where university is situated

Name

String

University name

Faculty

Positive integer

Faculty identifier

Faculty_name

String

Faculty name

Chair

Positive integer

Chair identifier

Chair_name

String

Chair name

Graduation

Positive integer

A year of graduation

Schools

Array

A list of schools user visited

School

Object

Id

Positive integer

School identifier

Country

City

Name

Year_graduated

Positive integer

Year, when user graduated the school

Class

String

A class letter

Common_count

Positive integer

A number of common friends with current user

Occupation

Object

An information about current activity of the user

Id

Positive integer

School/work/niversity identifier

Name

String

Organization name

Personal

Object

A set of field from the section “Life position”

Political

Positive integer

Political preferences

Religion

String

Religion

People_main

Positive integer

What is main factor in people?

Life_main

Positive integer

What is main in life?

Smoking

Positive integer

Attitude to smoking

Alcohol

Positive integer

Attitude to alcohol

Activities

String

Activities

Interests

String

User's interests

Music

String

Music, which user likes

Movies

String

Favorite movies

Tv

String

Favorite tv shows

Books

String

Favorite books

Games

String

Favorite games

3. WORK'S SCIENTIFIC VALUE

Though Vkontakte social network was introduced just two years after Facebook and currently is the most popular social network in Russia [63], there are almost no researches connected to Vkontakte analysis. Currently there are a number of companies, which provide complete social network analysis including Facebook, Vkontakte and Twitter; nevertheless the algorithms implemented by these companies are hidden from public. The novelty of this thesis is due to the implementations of own proposed algorithm for Vkontakte analysis.

The relevance of this thesis is that due to the budget limitations all the algorithms are implemented with help of freeware, which means that the availability of source code or the experience in software development will allow to reproduce the research and use it for private purposes. This may be helpful for small businesses operating or starting to operate in Vkontakte.

4. METHODOLOGICAL VARIETY

4.1 Network measures

A Social Media Mining book [64] provides a good information about measures applicable for social graph:

• Centrality

• Degree Centrality

• Eigenvector Centrality

• Local Clustering coefficient

4.2 PageRank

PageRank [65] is Google's method for the calculation of web page importance. After html tag Title and key words are considered, PageRank is use to rank sitas according their importance and rive a ranked result to a user.

Google theory says that if page A refers to page B, then page A considers page B important. PageRank also influence link to page importance. If many important links refer page, then its links to other pages become more important.

Nowadays PageRank has lost its former significance because of the changing of link meaning (first, the link was an obvious recommendation to a site, but now it is just a connection with it).

The PageRank formula is next:

,

Where PR(A) is a PageRank of page A (a target measure),

d -- attenuation constant, which is usually set 0,85

PR(T1) -- is a PageRank of page T1 (a page, which links to page A),

C(T1) -- number of links from page T1.

4.3 SimRank

SimRank was proposed by G.Jen and J.Widom [66] as a similarity measure method, based on graph-theoretic model.

The concept of SimRank is the similarity of objects depends on the number of common referencing objects. Thus, the basic equation for SimRank is:

,

Where a, b - nodes, I(a), I(b) - a set of in-neighbors of nodes a and b respectively, C - a constant between 0 and 1.

A solution to the SimRank equations for a graph G can be reached by iteration to a fixed-point. Let n be the number of nodes in G. For each iteration k, we can keep n2 entries sk(*,*), where sk(a,b), gives the score between a and b on iteration k. sk+1(*,*) can be successfully calculated based on sk(*,*). The iteration is started on s0(*,*), where s0(a,b,) is a lower bound of SimRank score s(a,b):

Then, the equation (4) for is used to compute sk+1(a,b)

,

Thereby, similarity of (a,b) is updated on each iteration k+1 by using the similarity scores of neighbors of (a,b) from the iteration k.

In [60] authors have shown that the values converge to limits satisfying the basic SimRank equation, that is:

,

Originally, authors proposed C=0,8 and k=5 for SimRank calculation. Nevertheless, research of Russian Academy of Sciences members [67] showed that these values gave not accurate result and it is recommended to use C< 0,8 and take more iterations.

Picture 4 shows the directed graph and calculated SimRank score.

Picture 4. Directed Graph and SimRank for C=0,5

4.4 Binary coefficients

When talking about similarity between two objects, like two users in Vkontakte, binary similarity coefficients [68] may be widely used for similarity detection and ranking.

Similarity coefficient is a non-dimensional metric, which was initially used in biology for quantitative defining of the similarity degree between biological objects. It is included into measures of proximity which include diversity measure, concentration measure, inclusion measure, similarity metric, distinction measures (including distances), event compatibility, measure interdependence measure and interindependence measure. The theory of proximity measures is in the state of formation, so there are various ideas about proximity relations formalization.

Proximity measures are widely used in biology, where different areas are compared (districts, phytocenosis, zoocenosis etc.). They are also used in geography, sociology, pattern recognition, searching engines, comparative linguistics, bioinformatics, cheminformatics and other areas.

Most of the coefficients are normalized and in the range of 0 (no similarity) to 1 (full similarity). Similarity and distinction complement each other (mathematically Similarity=1-Distinction).

Taking into consideration thesis topic, the next binary similarity coefficients may be used:

1. Jaccard index, or Jaccard similarity coefficient, is used for comparing similarity between two sets of objects. It is the first known similarity coefficient and was proposed in 1901. Jaccard index is calculated with the following formula:

Where A is a first set, B is the second set, |A| - the cardinality of a set A.

Thus, if A={1,3,4,5,7} and B={1,2,3,4,5,8}, then |A|=5, |B|=6, |A?B|=|{1,3,4,5}|=4 and .

2. The overlap coefficient, or Szymkiewicz-Simpson coefficient, is a similarity measure related to the Jaccard index that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets:

3. Cosine coefficient, or Ochiai coefficient, is a binary similarity measure proposed by Japanese biologist Akira Ochiai in 1957. The Cosine coefficient is calculated as follows:

4.5 Software development approaches

Waterfall

Waterfall model [69] is sequential model of software development, which successively passes phases on requirements analysis, design, implementation, testing, integration and support. It was first mentioned in the article of W.W.Royce in 1970, where he described it like a conception and considered its drawbacks. Original waterfall model includes six phases in the following order:

1. System and software requirements: captured in a product requirements document

2. Analysis: resulting in models, schema, and business rules

3. Design: resulting in the software architecture

4. Coding: the development, proving, and integration of software

5. Testing: the systematic discovery and debugging of defects

6. Operations: the installation, migration, support, and maintenance of complete systems

When following waterfall model, a developer sequentially passes all phases in the strict order. The next phase cannot be started until the previous as finished and phases cannot cross with each other.

The obvious advantages of using waterfall model are:

1. Complete documentation. In case team is changed during project implementation, the new team has all the documents to continue development. Moreover, the support team may not depend on development team

2. Waterfall model provides complete and easily understandable sctucture, which allows detecting milestones.

3. The fact that phases are followed sequentially allows to reveal and fix problems in project in early stages.

The disadvantages are:

1. Clients may not know exactly what their requirements are before they see working software and so change their requirements, leading to redesign, redevelopment, and retesting, and increased costs.

2. Designers may not be aware of future difficulties when designing a new software product or feature, in which case it is better to revise the design than persist in a design that does not account for any newly discovered constraints, requirements, or problems.

3. In response to the perceived problems with the pure waterfall model, modified waterfall models were introduced, such as "Sashimi (Waterfall with Overlapping Phases), Waterfall with Subprojects, and Waterfall with Risk Reduction".

4. Some organizations, such as the United States Department of Defense, now have a stated preference against waterfall type methodologies, starting with MIL-STD-498, which encourages evolutionary acquisition and Iterative and Incremental Development.

5. While advocates of agile software development argue the waterfall model is an ineffective process for developing software, some sceptics suggest that the waterfall model is a false argument used purely to market alternative development methodologies.

Agile

Agile is a series of approaches to software development, which are oriented on interactive programming exploitation, dynamic requirements formulation and providing its implementation by the interaction of self-organizing working groups, which consist of different specialists.

Agile is used as an effective practice of small groups work organization and its management with combined method.

One of the methodologies using agile approach is Scrum.

Currently Scrum is one of the most popular software development methodologies. According to definition, Scrum is a software development framework, which allows project team to solve emerging issues efficiently and to produce a product of highest importance for client [70].

When people talking about Scrum methodology they usually mean agile approach to software development based on rules and practices of Scrum. Thus the Scrum effectiveness varies from team to team, depending on following the Scrum guide and on the experience of agile approach exploitation.

The following features of Scrum are stated by the authors (Ken Shcwaber and Jeff Sutherland ):

· Lightweight

· Understandable and accessible

· Hard to master

Scrum Roles:

· Product Owner

· Scrum Master

· Development team

Product owner (PO) is a touchpoint between development team and client. PO task is to increase the value of product and team operation as much as possiple. Product Owner uses product Backlog, which contains tasks for completion by dev team (Stories, Bugs etc.), ordered by priority.

Scrum master is a team's servant leader. His task is to help team to maximize their effectiveness by removing obstacles, helping with work organization, team training and motivation and helping PO. Scrum Master should moderate all the work seminars and Daily scrum.

Development team consists of specialists developing product directly. According to the Scrum Guide, development team should possess the following qualities:

· Self-organizing

· Cross functioning (team members should have different technical skills and be able to help to each other in different areas - developing, testing etc.)

· Be entirely responsible for product quality (responsibility is laid on the whole team, not on the separate members)

Dev team usually consists of 7±2 members. According to Scrum Guide, the larger team needs more problems with communication, while the smaller team has risks of skills and competencies, required for product development, absence and reduces the volume of work able to be done during one iteration.

Scrum Artefacts

· Product Backlog - a prioritized list of tasks needed for product development and implementation completion. Product Backlog is dynamic and constantly updated. It is the only source of product features.

· Sprint Backlog - a set of tasks from product Backlog selected for implementation in Sprint.

· Increment - a set of completed during Sprint tasks. At the end of the Sprint increment must be “Done” and meet the team's definition of “Done”. Usually increment is a set of features, which may be given to the client at the end of the Sprint and add a value to operating product.

Scrum Events

· Sprint - an iteration of product development. The time-box of the sprint is usually one to four weeks. If Sprint is running:

o It is not allowed to do any changes that may endanger Sprint completion (i.e. removing team member, reducing the sprint time etc.)

o Quality goals must not be decreased

o Scope of the sprint may be clarified and renegotiated between the Product Owner and Development Team as more is learned.

· Sprint planning - an event of filling Scrum Backlog with Backlog items. Team should fully understand, what Backlog items together will bring an increment

· Daily Scrum - everyday activity during sprint, on which team discusses the job done in the previous day, today's activities and issues, if they emerged

· Sprint review - the analyzing of product increment by client and checking if it meets the definition of `Done'.

· Retrospective - an event for team weaknesses analyzing and for team improvement.

5. TOOLS SELECTION

5.1 Programming Language and IDE

Selecting the programming language and IDE for development, author has examined two most popular open-source languages applicable for statistical and graph analysis: R and Python. social network programming integrated

R [71] was created in 1995 by Ross Ihaka and Robert Gentleman and focuses on user-friendly data analysis, statistics and graphical models. Formerly R was used in academic society, but now it has earned large popularity due to its understandability and availability of a user-friendly IDE called RStudio.

Python [72] was released in 1991 by Guido Van Rossum and focuses on productivity and code readability. Python has a number of IDEs the most popular are Spyder and iPython Notebook.

Table 2

Language

Python

R

Skills of author

No

Novice

Code understandability

Good

Average

Ease of learning

Easy to start working having experience in other languages

Needs to know basics of R

IDE features

+Allows to work in a file manager mode, +NoteBook for code testing

+Functions as Web app

+Console

+Autocomplete

+Multi-window interface

+History

Libraries for Graph analysis and visualization

figure API

networkx

igraph

giGraph

GrapheR

Bingat

Network

igraph

Web Application development

Flask and jinja library

Shiny R application

Integration with Vkontakte

Vk.com API

vkR

Issues with Vkontakte integration

-

Wrong encoding. Problem cannot be fixed by decoding

Though R tend to be adapted for visualization more, than Python and author has an experience in R, encoding issue during Vkontakte data retrieval played a key role in the selection of programming language. As a result, Python was selected for application implementation.

The next Python packages are expected to be exploited:

· igraph

· NetworkX

· Vk library

· Flask

5.2 Packages and libraries

Igraph is a free open-source collection of network analysis tools for Python, R and C/C++ [74]. Allows to draw grouphs, count pageRank, eigenvector centralities etc

NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and function of complex networks [75]. The functionality of networkX is close to igraph's, but it is needed as it's extension possesses SimRank function.

Flask - a Python microframework for web application development. Authors claim, that it is intuitive and easy to exploit [76].

Vk library [77] is a package for interaction with Vkontakte API.

D3.js[78] - a javaScript library for data visualization.

5.3 Project management tool selection

Following the selection of software development approach, author needed a tool for Scrum methodology. Budget issues narrowed the variety of project management tools. The two most promising tools were selected for comparison: taiga.io and Stories on Board

Taiga.io [79](Picture 6) is a free and open source project management tool for agile practitioners. It was released in 2014 and in 2015 it has won in Most Valued Agile Tool nomination awarded by the Agile Portal [80].

Taiga possesses the next functions:

• Team management

• Backlog fulfillment

• Stories estimation

• Sprint creation

• Story decomposition (tasks may be created for stories)

• Workflow

Picture 6. Taiga.io backlog page

Stories on Board is a free user story mapping tool developed in 2014 by DevMads.Ltd.

The feature of StoriesonBoard[80] is organizing project Backlog into story map, organized as a grid. Horizontally all grouped functionalities are written, while vertically all user stories, which are needed to be done for functionality implementation, are noted. After organizing a grid, Sprint partitioning is made. Horizontal lines are built to represent a sprint border. Picture 7 shows the view of story map after sprint partitioning.

Picture 7. StoriesOnBoard interface.

According to the tools review author has made a decision to use taiga.io as its functionality is more traditional for Scrum support, moreover it has workflows which are informative during the sprint. StoriesOnBoard is more applicable for the whole project scope representation (especially if the project is quite big).

...

Подобные документы

  • The history of the English language. Three main types of difference in any language: geographical, social and temporal. Comprehensive analysis of the current state of the lexical system. Etymological layers of English: Latin, Scandinavian and French.

    реферат [18,7 K], добавлен 09.02.2014

  • What is social structure of the society? The concept of social structure was pioneered by G. Simmel. The main attributes of social structure. Social groupings and communities. Social status. Structural elements of the society’s fundamental institutions.

    реферат [25,4 K], добавлен 05.01.2009

  • Defining cognitive linguistics. The main descriptive devices of frame analysis are the notions of frame and perspective. Frame is an assemblage of the knowledge we have about a certain situation, e.g., buying and selling. Application of frame analysis.

    реферат [324,4 K], добавлен 07.04.2012

  • Methodological characteristics of the adaptation process nowadays. Analysis of the industrial-economic activity, the system of management and the condition of adaptation process. Elaboration of the improving project of adaptation in the Publishing House.

    курсовая работа [36,1 K], добавлен 02.04.2008

  • The process of scientific investigation. Contrastive Analysis. Statistical Methods of Analysis. Immediate Constituents Analysis. Distributional Analysis and Co-occurrence. Transformational Analysis. Method of Semantic Differential. Contextual Analysis.

    реферат [26,5 K], добавлен 31.07.2008

  • The analysis of four functions of management: planning, organizing, directing, controlling; and the main ways of improving functions of management. Problems with any one of the components of the communication model. The control strategies in management.

    контрольная работа [30,1 K], добавлен 07.05.2010

  • Social interaction and social relation are identified as different concepts. There are three components so that social interaction is realized. Levels of social interactions. Theories of social interaction. There are three levels of social interactions.

    реферат [16,8 K], добавлен 18.01.2009

  • Some important theories of globalization, when and as this process has begun, also its influence on our society. The research is built around Urlich Beck's book there "Was ist Globalisierung". The container theory of a society. Transnational social space.

    курсовая работа [24,5 K], добавлен 28.12.2011

  • Understanding of the organization and its structure. Any organization has its structure. Organizational structure is the way in which the interrelated groups of the organization are constructed. Development of management on the post-Soviet area.

    реферат [24,7 K], добавлен 18.01.2009

  • American Culture is a massive, variegated topic. The land, people and language. Regional linguistic and cultural diversity. Social Relationships, the Communicative Style and the Language, Social Relationships. Rules for Behavior in Public Places.

    реферат [35,1 K], добавлен 03.04.2011

  • The subjective aspects of social life. Social process – those activities, actions, operations that involve the interaction between people. Societal interaction – indirect interaction bearing on the level of community and society. Modern conflict theory.

    реферат [18,5 K], добавлен 18.01.2009

  • Characteristics of Project Work. Determining the final outcome. Structuring the project. Identifying language skills and strategies. Compiling and analysing information. Presenting final product. Project Work Activities for the Elementary Level.

    курсовая работа [314,5 K], добавлен 21.01.2011

  • The corporate development history and current situation strategy of the Computacenter. Opportunities and threats for Computacenter on the analysis of IT-industry and macro-environmental analysis. The recommendations for the future strategic direction.

    контрольная работа [27,5 K], добавлен 17.02.2011

  • Political power as one of the most important of its kind. The main types of political power. The functional analysis in the context of the theory of social action community. Means of political activity related to the significant material cost-us.

    реферат [11,8 K], добавлен 10.05.2011

  • The place and role of contrastive analysis in linguistics. Analysis and lexicology, translation studies. Word formation, compounding in Ukrainian and English language. Noun plus adjective, adjective plus adjective, preposition and past participle.

    курсовая работа [34,5 K], добавлен 13.05.2013

  • Origin of the comparative analysis, its role and place in linguistics. Contrastive analysis and contrastive lexicology. Compounding in Ukrainian and English language. Features of the comparative analysis of compound adjectives in English and Ukrainian.

    курсовая работа [39,5 K], добавлен 20.04.2013

  • The study of the functional style of language as a means of coordination and stylistic tools, devices, forming the features of style. Mass Media Language: broadcasting, weather reporting, commentary, commercial advertising, analysis of brief news items.

    курсовая работа [44,8 K], добавлен 15.04.2012

  • Development of harmonious and competent personality - one of main tasks in the process of teaching of future teachers. Theoretical aspects of education and competence of teacher of foreign language are in the context of General European Structure.

    контрольная работа [12,2 K], добавлен 16.05.2009

  • Systematic framework for external analysis. Audience, medium and place of communication. The relevance of the dimension of time and text function. General considerations on the concept of style. Intratextual factors in translation text analysis.

    курс лекций [71,2 K], добавлен 23.07.2009

  • Contradiction between price and cost of labor between the interests of employees and employers. Party actors and levels of social and labor relations. Basic blocks problem: employment, work organization and efficiency, the need for economic growth.

    реферат [19,7 K], добавлен 10.05.2011

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.