基于PageRank算法社交网络的改进与研究c
2014-11-19冷若冰袁航
冷若冰 袁航
In the final design of Page Rank, Bin and Page rule that the websites transmit the “importance measure” by using link. The “importance measure” of each website is equal to the sum of the “importance measure” that other websites transmit to it. So, the measure flows throughout the whole network. From the point of view of in-link, if a website gets a high measure, there may be two reasons. The first reason is that many websites give measure to it and the second is that few websites give measure to it but each of them gives lots of measure. From the point of view of out-link, now that the measure of a website is decided, the more out-links, the fewer measure each out-link can get.
Then we will show that how the “importance measure” is transmitted among websites. The top left corner website gets 100 measure. It transmits 50 measure to both the top right corner website and bottom right corner website through 2 out-links. The bottom left corner website only gets 9 measure. It transmits all its 9 measure to 3 websites and each of them gets 3 measure. Only one out-link transmits its measure to the top right corner website. The other two objects are not in the figure. So the top right corner finally gets 53 measure and the bottom right corner gets 50 measure. Since both of them have two out-links, the measure that each out-link transmits of the top right corner website is more than that of the bottom right corner.In-link of the website i: hyperlink directing to website i from other websites.Out-link of the website i: hyperlink directing to other websites from website i.
Define a directed network G=(V,E), V representsthe set of node, in other words, the set of all websites, E represents the set of directed edges in the network, which means hyperlinks. n equals to the number of websites in the network. So the PageRank value(represent as p(i)) of the website i can be define as :
P(i)=Oj means the number of the website js outer
link. In mathematics, we can get n linear equations with n unknown variables. A matrix can be used to represent all the equations. Use a n dimensional column vector P to represent all the PageRank values.
A equals to the adjacency matrix of the graph
The expression can be written as:
It can be seen that P is eigenvector that the eigenvalue of the matrix (1) corresponded to.
Solving this equation needs to satisfy some conditions. The matrix A must be a random matrix, which means it is irreducible(the directed graph that matrix A corresponding to is strong connected and nonperiodic. But a real network(or social network)doesnt satisfy those conditions. In fact, the equations above can be inferred through Markov-Chains.AT needs some modifications to satisfy the conditions above. To makeirreducible,which means every node has outlinks, a concept (denoted as d) named damping factor is defined, multiply AT by d and add, e is an all 1 n-dimensional vector, which means, the probability of any oneof websites linking to other websites is at least (1-d), and a strong connected graph is formed.
A modified PageRank model can be deduced:
If some personalized settings for the initial matrix are needed, we can add a value to every element in the adjacent matrix and convert to (named ‘personalization vector).The matrix G can be deuced:
The matrix G is also called ‘Google Matrix, the formula above can be expressed as:
Thehere is the same as the p above, only through a transposition. equals to the vector of the PageRank value, . Define as the unit matrixs column vector of column i, the PageRank value of the node i are equals to:
Since different personalization vector can be set and apparently for different vector V ,different can be deduced, so we use =(v) to represent it. In the simplest situation, assume v=e/n.
2 Definition of Community Tree
After getting Community Tree from the social network, the social networks community and its organization structure can be deduced. The graph 2.6 is an example. Node 1 and node 5 are the cores of community 1 and community 2 respectively and the immediate leader of node 1 and node 2.
PageRank algorithm calculate a global value for every website through analyzing the links between websites. Which means the significance.Every members significance in the social network can be evaluated by PageRank, calculating m-Score value for every node. In a network, random walks implements the soft cluster of the nodes implicitly. Thus, random walks can be used for every member in finding its immediate leader. A Community Tree can be formed by connecting random walks and m-Score value of every node.
3 . Detailed design
First of all, we get a one-step probability transition matrix of the social network G. T is the jump frequency of Random Walks. After the standardization, we will get the t-step probability transition matrix M. Then, we call calc_m-Score(G) to calculate the m-Score value of each node. For each node i, we will find the most possible node j that node i will jump to after t steps by using the t-step probability transition matrix M. If the PageRank value of node j is large than that of node i, we consider node i the father node of node j.
Pseudo-code that calculates the improved CT Tree
Algorithm: revised_CT_Deriving
Input: Social Network G, Jump frequency t
Output: The improved CT Tree
Procedures:
1. CT ←[null,…,null]
2. A ←getOneStepTransMatrix(G)
3. Z ← diagonal matrix satisfied Zjj = ∑i[At]IJ
4. Mt ← At.Z-1
5. R ←calc_m– Score(G)
6. For each Pi in R
7.list ← Mt[i]
8. list.sort(reverse = True)
9. for k in len(list)
10. If R[k] > R[i]
11. CT[i]←k
12. k ←k-1
13. End
14. Return CT
In the improved CT_Deriving, when selecting the father node of node i, we will not choose the node with the largest t-step transition probability. Firstly, we sort the t-step transition probability of all nodes and check every PageRank value of node k until we find a node k whose PageRank value is bigger than that of node i. Then we will set node ks PageRank value as node is PageRank value.
4.result
In the graph 3.10. the blue broken line represents the trend of PageRank value which is without offset, the red broken line represents the trend of PageRank value which is offset, the green broken line represents the trend of PageRank value when p_2 has been offset, it can be clearly seen that the PageRank value of node{5,6,7,11,7} is increasing by the level of offset, other nodes, otherwise, shows different level of decrease.
We can see that after offset, node{5,6,7,11} accesses in Candidate Set, also their action scope can be clearly seen. Node 17 didnt access in the Candidate Set, after offset, its action scope changed from 3 nodes(13,18,22) to 4 nodes(12,13,18,22), but node1 remains in the Candidate Set, it can be seen that though it is not preponderate in ‘interested, its ‘influence cant be ignored since it has lots of ‘friends.
3.10
In the model, we made some improvements on the creation method of the CT Tree. After some tests, we can reflect individual vector of user behavior by custom made. It finally affect the PageRank value of the user. In this case, the PageRank value consists of the information about network linking itself and user behavior. Combined with the improved Random Walks Algorithm, we can confirm the “loose relationship” among users. This relationship reflects that nodes may affect each other with a certain probability. We present the users dependency by using a CT Tree within a figure and select a certain number of decision nodes in the CT Tree. The information publisher can affect other nodes by reference nodes.