IsoGloVe: A New Count-based Graph Embedding Method based on Geodesic Distance

Graph embedding techniques have gained increasing attention for their ability to encode the complex structural information of networks into low-dimensional vectors. Existing graph embedding methods have achieved considerable success in various applications. However, these methods have limitations in capturing global graph topology information and fail to provide insights into the underlying mechanisms of network function. In this paper, we propose IsoGloVe, a count-based method that encodes graph topology into vectors using the co-occurrence statistics of fixed-size routes in random walks. IsoGloVe calculates the final embeddings based on the geodesic distances of the node’s neighbors on a manifold. This representation in geodesic space allows for the analysis of node interactions and contributes to a better understanding of complex network structure and function. The performance of IsoGloVe is evaluated on various protein-protein interactions (PPI) using graph reconstruction, node classification, and visualization. The findings reveal that IsoGloVe surpasses other comparable methods with a 30% increase in MAP for graph reconstruction and a 25% increase in model scores for node classification in the Yeast PPI network. In addition, IsoGloVe demonstrated a 6.9% increase in MAP for graph reconstruction on the Human PPI network.


Introduction
Graph embedding has been found to be effective in addressing challenges in Natural Language Processing (NLP) and Machine Learning (ML) by producing comprehensive representations of networks with varying densities [1].One way to compute graph embeddings is by using NLP techniques, which learn rich vector representations of graphs as feature sets for downstream tasks.Several graph embedding methodologies have been presented in the literature, with one of the more frequently employed techniques being node2vec [2].This particular approach involves the application of random walk procedures to the graph.These random walks can be considered to approximate the geodesic distance between nodes in the graph, but they are not necessarily the shortest path between nodes [3].Graph Factorization (GF) [4] embeds the graph adjacency matrix by optimizing a weight regularization term with a coefficient, which controls the generalizability of the embeddings.HOPE [5] extends GF by preserving first-order and second proximity by factorizing a similarity matrix between nodes using an extended Singular Value Decomposition (SVD).Locally Linear Embedding (LLE) [6] assumes all nodes are a linear combination of their neighbors in the embedding space.Laplacian Eigenmaps (LE) [7] keeps the embedding of nodes close to each other if the edge weight between them is high.Global vectors for node representation [8] is another embedding technique that incorporates textual content associated with the nodes to learn meaningful representations of words and documents and provides recommendations based on the distance between words, documents, and graph embeddings based on Global Vectors for word representations (GloVe) [9].
While none of the representation learning techniques mentioned explicitly integrate geometric features or account for underlying geometric features within the vector space, several studies, both theoretical and experimental -have explored the incorporation of more intricate spaces for embedding purposes [10] [11].Furthermore, embedding methods need to be able to handle large, complex networks like protein-protein interaction (PPI) networks and preserve their global properties.This paper proposes IsoGloVe, a graph representation that addresses challenges in node representation learning.IsoGloVe generates embeddings by normalizing and log-smoothing co-occurrence matrices of random walks, using geodesic distances between nodes to capture the intrinsic geometry of the network.This approach has several advantages: it considers the graph structure, preserves relationships between nodes, captures nuanced similarities, and is robust to noisy or missing nodes.The main contribution of the work is mapping complex networks to a vector space that preserves community structure and structural equivalence, using geodesic distances to calculate the similarity between vectors.This is the first time a count-based embedding method and a non-Euclidean metric are used for graph embeddings.

Proposed IsoGloVe Method
IsoGloVe is a method for generating dense representations of networks.This is accomplished by computing geodesic distances between nodes.The resulting embeddings can be used for various tasks such as node classification, link prediction, and community detection.IsoGloVe generates embedding vectors in three steps.
Step 1: IsoGloVe randomly generates a truncated random walk for each node using the edge list.The random walk algorithm preserves the local neighborhood of nodes by randomly choosing a node and performing a random walk of length L.
Step 2: IsoGloVe constructs a large matrix of the co-occurrence of nodes from the random walks.In this matrix, the i th row and the j th column represent the value of the co-occurrence of node i and j in random walks.The co-occurrence count matrix needs to be factorised to yield a low-dimensional (D) matrix, where each row represents a node vector.The size of D depends on the size of the graph, and there are high correlations between the rows related to nodes that are close in the graph that motivate the factorization.
Step 3: The vector representation of each node is trained by minimizing the difference between the geodesic distance among the embeddings and the logarithm of their co-occurrence count.To accomplish this, the counts in the matrix are first normalized and smoothed using logarithms.Next, the geodesic distance between every pair of nodes is calculated.For training the vector representations, the same loss function as GloVe is adopted, but with geodesic distance instead of dot product.The IsoGloVe loss function is expressed as follows: where d(n i , ñj ) is the geodesic distance between the vector representations of node n i and the context node ñj , for the i th and j th nodes in the vocabulary V .b i and bj are the corresponding biases, and X ij is the co-occurrence probability between the i-th and j-th node.
The function f (X ij ) down weights the contribution of frequent node pairs.A similarity between n and ñj in Euclidean space can be obtained by taking the inner product of n and ñj [12].However, the dot product of two vectors isn't always the same as their similarity in a geodesic space, and other measures of similarity may be more appropriate.The geodesic distance is the sum of edge weights along the shortest path between two nodes.The top N eigenvectors of the geodesic distance matrix, represent the coordinates in the new N-dimensional Euclidean space.In this research, we adopt the Riemannian metric to compute the distance between two points within the embedding space.The utilization of this metric allows for the incorporation of the intrinsic curvature of the embedding space, leading to a more efficient modeling of non-linear relationships between nodes [13].First, we compute the pairwise geodesic distances between all the nodes in the original high-dimensional space and use these distances to define a Riemannian metric.The Riemannian metric is then used to embed the nodes in a low-dimensional space, such that the pairwise distances between the embedded nodes approximate the original geodesic distances as closely as possible.More specifically, given a Riemannian manifold (M, g), where M is the underlying space and g is a Riemannian metric tensor defined on M , the distance d(p, q) between two points p and q in M is given by the length of the shortest path connecting them, which is the Riemannian distance: where the infimum is taken over all piecewise smooth paths γ connecting p and q, and L(γ) is the length of the path γ defined by: where a and b are the starting and ending points of the path γ, and g(•, •) is the Riemannian metric tensor.
There are some challenges in accurately measuring geodesic distances.First, if the k-nearest neighbor graph is not connected, then the shortest path between some node pairs may not exist, and the value of d(w i , w j ) would be undefined.In this case, the IsoGloVe loss function would become infinite.Therefore, it is important to use an algorithm such as Dijkstra's algorithm which finds the shortest path between two nodes in a graph, regardless of its connectivity.Second, using Dijkstra's algorithm, the memory required to store the distance matrix and the eigenvectors, and the eigenvalue decomposition of a matrix, can be computationally expensive for large networks.The time complexity of computing the eigenvalue decomposition of a matrix scales as O(n 3 ), where n is the number of nodes in the graph, making it infeasible for large networks with tens of thousands or even millions of nodes.IsoGloVe method uses an algorithm called Lanczos [14] to perform eigenvalue decomposition of the geodesic distance matrix, in order to find node embeddings in a lower-dimensional space.This method is designed to capture the underlying structure of the network, resulting in compact and interpretable representations that preserve both the global and local relationships between nodes.The resulting embeddings can explain a significant portion of the variance in the original high-dimensional network.
The overall computational complexity of IsoGloVe for a given network G(V, E) and considering K nearest neighbors for each node is O( is the complexity for the nearest neighbor search and computing shortestpath graph search to find the geodesic distance between two nodes in the graph.O(V 2 ) is the complexity of learning d-dimension embeddings from the co-occurrence matrix.

Experiments
We evaluate the proposed methods on PPI networks [15] summarized in Table 1.By representing PPI networks in geodesic space, researchers can use graph-based techniques to analyze the interactions between proteins and uncover the underlying biological mechanisms.We evaluate IsoGloVe by applying graph reconstruction, visualization, and node classification.For graph reconstruction, we reconstruct and rank the node representations according to their proximity.Then we calculate the neighbors of links in top k predictions as the reconstruction precision.In addition, since network embeddings capture the network structure, they can be helpful for node classification.Therefore, we compare the potency of pre-trained IsoGloVe embeddings by considering them as node features for classification.
The node features are input to a Leave-one-out cross-validation classifier.The classifiers used in this study are: Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Gaussian Naïve Bayes (NB) [16].The choice of the classifier depends on the specific requirements of the problem and the characteristics of the data.These classifiers are commonly used in the evaluation of embeddings because they are well-established and well-understood, can be easily adapted to handle high-dimensional data, and have been shown to perform well on various graph analysis tasks.Moreover, the learned graph embeddings can be visualized and help us understand the network's topological features.IsoGloVe learns a 100-dimensional embedding and inputs it to t-SNE [17] to reduce the dimensionality to two for visualizing the nodes in a 2-dimensional space.For visualizing the network, 25% of the nodes in the Yeast PPI network are randomly sampled.We implemented IsoGloVe with PyTorch 1.11 and the experiments were performed on 2 Intel(R) Xeon(R) CPU X5660 processors @ 2.80GHz and 16GB memory.The experiment involved different embedding methods, each with specific parameters set.HOPE had an attenuation factor of 0.01, GF used a regularization parameter of one and a learning rate of 10 −4 .Node2vec and IsoGloVe both had a walk length of 50, while node2vec also had a context size of 10.The dimensionality of the pre-trained embeddings was set to 100 for all methods.However, LLE and LE had a high time complexity ((O(|E|d 2 ))), and some eigenvectors did not converge for the Tissue PPI network, so node classification could not be performed for these datasets.

Results and Discussion
Table 2 shows the efficiency of the models on three datasets.The (Mean Average Precision (MAP) of graph reconstruction indicates that IsoGloVe outperforms all baseline models on all datasets.It achieved higher scores for all classifiers except SVM and KNN on the Yeast PPI network, and similar scores for SVM as other baselines on the Human PPI network.However, SVM may not perform as well on this data as the distribution is a manifold, making it difficult to separate the classes using a linear boundary.On the other hand, KNN may not perform well because it relies on the local structure of the data.In a manifold, the local structure can change rapidly as the distance from a point increases.Moreover, the improvement over the baselines on the Tissue PPI network is more remarkable than the results for the two other PPI networks.This result can be explained according to the different cellular complexity and size of the Tissue PPI network and other PPI networks.The visualization results in Figure 2 for a sample network (Yeast PPI network) reveal the properties that each embedding method can preserve.IsoGloVe, node2vec, and GloVe cluster structurally equivalent nodes together, while GF, LE, LLE, and HOPE preserve only the community structure and keep connected nodes close.Additionally, IsoGloVe has the ability to differentiate between high-degree hubs and central nodes in their communities, as demonstrated by visualization and graph reconstruction.We perform Kruskal-Wallis statistical tests for IsoGloVe, node2vec, GF, and HOPE results on all datasets.The H statistic is 8.96 (3, N = 44) and the p-value is 0.029.The result is significant at p < .05.

Impact of Vector Dimension
The dimension of the vectors can affect their quality in several ways.A larger dimension provides more capacity to represent relationships between entities, but also leads to longer training times and potential overfitting, causing poor performance on new networks.Additionally, larger dimensions make embeddings harder to interpret and visualize, and the relationship between dimensions may become less clear.The optimal dimensionality depends on the specific application and training data's amount and quality.Since IsoGloVe is an embedding method that captures relationships between nodes based on the co-occurrence patterns in random walks, the specific dimensionality of the vectors used in IsoGloVe is not as important as with other methods, but changing the vector size can still affect the quality of the embeddings.Figure 1 illustrates the effect of embedding dimension on (a) node classification with Naive Bayes and (b) graph reconstruction, in the Tissue PPI network.We make two observations, first, in both cases, IsoGloVe is robust against changes in vector size, one reason for this case is that with more parameters, the models overfit on the observed links and are unable to predict class labels.Second, the relative performance of the baseline methods depends on the embedding dimension.For graph reconstruction, node2Vec outperforms other methods for higher dimensions, whereas embedding generated by IsoGloVe achieves a higher model score for low dimensions.

Conclusion and Future Work
IsoGloVe is a new graph embedding technique which is based on the assumption that data is placed on a manifold.To calculate the embeddings, we estimate the neighbors of each node on the manifold.Our method maintains the similarities and dependencies between nodes by using a distinct metric in the vector space, leading to richer vector representations than the baselines.Our assessment of IsoGloVe employs PPI networks, which are less complex than Freebase networks.Further work will concentrate on evaluating the quality of the node embeddings generated by IsoGloVe on Freebase graphs.

( a )Figure 1 .Figure 2 .
Figure 1.Assessing different embedding models for Tissue PPI network at different dimensions.

Table 1 .
Detail information about PPI networks used in this study.

Table 2 .
Performance comparisons (Model score and MAP) on three PPI