Tensor Manifold-Based Graph-Vector Fusion for AI-Native Academic Literature Retrieval
The rapid development of large language models and AI agents has triggered a paradigm shift in academic literature retrieval, putting forward new demands for fine-grained, time-aware, and programmable retrieval. Existing graph-vector fusion methods still face bottlenecks such as matrix dependence, storage explosion, semantic dilution, and lack of AI-native support. This paper proposes a geometry-unified graph-vector fusion framework based on tensor manifold theory, which formally proves that an academic literature graph is a discrete projection of a tensor manifold, realizing the native unification of graph topology and vector geometric embedding. Based on this theoretical conclusion, we design four core modules: matrix-independent temporal diffusion signature update, hierarchical temporal manifold encoding, temporal Riemannian manifold indexing, and AI-agent programmable retrieval. Theoretical analysis and complexity proof show that all core algorithms have linear time and space complexity, which can adapt to large-scale dynamic academic literature graphs. This research provides a new theoretical framework and engineering solution for AI-native academic literature retrieval, promoting the industrial application of graph-vector fusion technology in the academic field.
Tensor Manifold-Based Graph-Vector Fusion for AI-Native Academic Literature Retrieval
Authors:
Xing Wei \and Yang Yu
Dongbi Scientific Data Lab, Beijing 100190, China, yuyang@dongbidata.com
Abstract
The rapid development of large language models and AI agents has triggered a paradigm shift in academic literature retrieval, putting forward new demands for fine-grained, time-aware, and programmable retrieval. Existing graph-vector fusion methods still face bottlenecks such as matrix dependence, storage explosion, semantic dilution, and lack of AI-native support. This paper proposes a geometry-unified graph-vector fusion framework based on tensor manifold theory, which formally proves that an academic literature graph is a discrete projection of a tensor manifold, realizing the native unification of graph topology and vector geometric embedding. Based on this theoretical conclusion, we design four core modules: matrix-independent temporal diffusion signature update, hierarchical temporal manifold encoding, temporal Riemannian manifold indexing, and AI-agent programmable retrieval. Theoretical analysis and complexity proof show that all core algorithms have linear time and space complexity, which can adapt to large-scale dynamic academic literature graphs. This research provides a new theoretical framework and engineering solution for AI-native academic literature retrieval, promoting the industrial application of graph-vector fusion technology in the academic field.
Keywords: Graph-Vector Fusion; AI-native Academic Retrieval; Tensor Manifold; Dynamic Graph Embedding; Discrete Exterior Calculus; Academic Literature Graph; Riemannian Manifold Index; AI Agent
1. Introduction
This chapter introduces the research background, core research problems, research significance, and research status at home and abroad, and finally outlines the paper's organization and core contributions. It aims to lay a clear foundation for the subsequent research content and clarify the research value and innovation of this study.
1.1 Research Background
The rapid development of large language models (LLMs) and AI agents has triggered a paradigm shift in academic literature retrieval. Traditional retrieval systems, which rely on keyword matching and simple citation sorting, can no longer meet the new demands of AI-native scenarios, including fine-grained knowledge positioning (locating specific paragraphs, arguments, or experimental data rather than entire papers), temporal awareness (tracking the evolution of cutting-edge knowledge), programmable retrieval logic (customizing retrieval strategies based on research needs), and interpretable results (providing clear reasoning paths for retrieval outcomes).
In this context, graph databases and vector databases, as two core technical infrastructures, have inherent limitations. Graph databases excel in modeling explicit knowledge relationships (e.g., citation, support, refutation) between academic literature but suffer from cumbersome global matrix maintenance and low efficiency in semantic retrieval. Vector databases, on the other hand, are proficient in capturing semantic similarity through pre-trained language models but lack the ability to model explicit topological relationships and fine-grained logical reasoning.
Industrial practice has further verified the inevitable trend of graph-vector fusion: mainstream cloud vendors (both domestic and international) are phasing out pure graph database products and shifting towards graph-vector fusion solutions. However, existing fusion frameworks still have critical flaws that hinder their application in academic literature retrieval, such as over-reliance on global Laplacian matrix operations, high-dimensional tensor storage explosion, semantic dilution in topological encoding, and the lack of native support for temporal characteristics and AI agents. These flaws are particularly prominent in academic literature retrieval, which features hierarchical knowledge granularity (paper-section-knowledge unit) and strong temporal dynamics.
1.2 Core Research Problems
Against the above background, this study focuses on AI-native academic literature retrieval and addresses the following four core research problems:
- How to design a matrix-free and iteration-free dynamic graph embedding update mechanism for academic literature graphs, so as to avoid the bottlenecks of global matrix maintenance and SGD-based iterative optimization in existing methods?
- How to realize lightweight topological encoding that unifies semantic and topological features of academic literature graphs, while avoiding semantic dilution and high-dimensional storage explosion?
- How to effectively model the hierarchical knowledge granularity and temporal characteristics of academic literature in a graph-vector fusion framework, so as to support fine-grained and time-aware knowledge retrieval?
- How to design an AI-agent-native retrieval interface that outputs structured, interpretable, and programmable results, adapting to the decision-making logic of AI agents in automated scientific research workflows?
1.3 Research Significance
This research has important theoretical significance and practical application value, closely aligning with the paradigm shift of AI-native academic literature retrieval and the industrial trend of graph-vector fusion.
In terms of theoretical significance, this study breaks through the inherent separation of graph topology and vector geometric embedding in traditional research. It proposes a geometry-unified theoretical framework based on tensor manifold theory, formally proving the diffusion equivalence between academic literature graphs and tensor manifolds. This framework enriches the theoretical system of graph-vector fusion and dynamic graph embedding, and provides a new theoretical perspective for lightweight and efficient fusion in large-scale dynamic scenarios. Additionally, the proposed matrix-free temporal diffusion update mechanism and hierarchical manifold encoding method break through the technical bottlenecks of existing methods, laying a theoretical foundation for the integration of graph and vector technologies in academic retrieval.
In terms of practical application value, the optimized graph-vector fusion framework designed in this study is specifically tailored for AI-native academic retrieval scenarios. It can effectively solve the pain points of existing retrieval systems, such as poor fine-grained positioning, lack of temporal awareness, and unfriendliness to AI agents. The framework supports microsecond-level incremental updates of massive academic literature graphs and real-time programmable retrieval responses, which can be widely applied to AI-agent-driven automated scientific research workflows. It helps researchers and AI agents efficiently obtain fine-grained academic knowledge, track knowledge evolution, and improve research efficiency. Meanwhile, the research results can provide technical references for cloud vendors and academic platform developers to launch AI-native academic retrieval products, promoting the industrial application of graph-vector fusion technology in the academic field.
1.4 Research Status at Home and Abroad
This section reviews the research progress and existing deficiencies in three research directions closely related to this study: graph-vector fusion for data management, academic literature retrieval systems, and dynamic graph embedding.
In the field of graph-vector fusion for data management, with the rise of LLMs, graph-vector fusion has become a research hotspot. AWS Neptune Analytics redefines graph databases by taking vector similarity search as the core engine and downgrading graph traversal to a visualization layer, verifying the feasibility of vector-based substitution for graph traversal in soft proximity scenarios. Neo4j, a leading native graph database vendor, has shifted its strategy from "graph vs. vector" to "graph + vector" fusion, upgrading vectors to first-class data types. Existing academic research on graph-vector fusion mainly focuses on two aspects: embedding-based fusion and index-based fusion. Embedding-based fusion maps graph nodes to vector space through graph embedding algorithms (e.g., GraphSAGE, Node2Vec) and combines vector semantic retrieval with graph topological reasoning. Index-based fusion designs hybrid indexes integrating graph topological indexes and vector geometric indexes. However, these works either rely on global Laplacian matrix operations for embedding updates or use high-dimensional tensor encoding for edge topological features, leading to matrix maintenance bottlenecks and storage explosion, which are not suitable for large-scale academic literature graphs with hierarchical and temporal characteristics.
In the field of academic literature retrieval systems, traditional systems are dominated by keyword matching and citation sorting. With the development of natural language processing, semantic retrieval systems based on pre-trained language models (e.g., SBERT, BERT) have been proposed, which capture the semantic similarity of literature and improve retrieval accuracy. In the AI agent era, researchers have begun to explore AI-native academic retrieval systems that integrate LLMs for intent parsing and knowledge reasoning. However, existing AI-native retrieval systems still have two critical limitations: lack of fine-grained knowledge positioning (only retrieving entire papers rather than specific sections or knowledge units) and weak knowledge relationship modeling (failing to track explicit knowledge trajectories such as citation and refutation, resulting in poor interpretability). Graph-based academic knowledge graphs attempt to solve these problems but suffer from low semantic retrieval efficiency and cannot adapt to AI-agent-driven programmable retrieval.
In the field of dynamic graph embedding, a core technology for graph-vector fusion in dynamic scenarios (e.g., real-time updated academic literature graphs), existing methods are divided into retraining-based and incremental update methods. Retraining-based methods retrain the embedding model from scratch when the graph topology changes, resulting in high update costs and poor real-time performance. Incremental update methods only update the embedding of nodes and edges affected by topology changes, but most rely on SGD-based iterative optimization to correct embedding errors, leading to problems such as lock competition in stream processing and hyperparameter tuning complexity. Additionally, existing methods lack effective error accumulation control mechanisms, leading to embedding drift in long-term continuous updates, which limits their application in large-scale dynamic academic literature graphs.
1.5 Paper Organization and Core Contributions
This section clarifies the overall structure of the paper and summarizes its core contributions, highlighting the innovations and differences from existing research.
The rest of the paper is organized as follows: Chapter 2 introduces the relevant mathematical foundations and theoretical preparations, including discrete exterior calculus, Hodge decomposition, tensor analysis, and graph embedding, laying a solid mathematical foundation for subsequent theoretical proofs and framework design. Chapter 3 focuses on the underlying mathematical unification of graphs and vectors, formally proving the core theoretical conclusions of this study and providing a rigorous theoretical basis for the proposed framework. Chapter 4 details the design of the optimized graph-vector fusion framework for AI-native academic retrieval, including the overall architecture, four core modules, and engineering design details. Chapter 5 designs the core algorithms of the framework, provides formal pseudo-code and complexity analysis, and supplements the compatibility and scalability design of the algorithms. Chapter 6 implements a prototype system of the graph-vector fusion framework and conducts performance verification experiments to prove the effectiveness and efficiency of the framework. Chapter 7 conducts an industrial empirical survey, compares the current status of graph database products of Chinese and American cloud vendors, and provides industrial evidence for the theoretical conclusions of this study. Chapter 8 analyzes the gaps between theoretical research and industrial application, and puts forward preliminary solutions to fill the gaps. Chapter 9 proposes the industrial landing path and commercialization suggestions of the framework, realizing the transformation from theoretical value to industrial value. Chapter 10 summarizes the full text, analyzes the research limitations, and puts forward future research directions. The appendix provides supplementary supporting content such as complete symbol definitions, additional pseudo-code, and experimental details.
This study makes four key contributions to the fields of graph-vector fusion and AI-native academic literature retrieval:
First, in terms of theoretical framework, this study proposes a geometry-unified graph-vector fusion theoretical framework based on tensor manifold theory. It formally defines the academic literature graph as a discrete projection of a tensor manifold, realizing the intrinsic unification of graph topology and vector geometric embedding. This framework provides a new theoretical perspective for lightweight graph-vector fusion in academic literature retrieval and enriches the theoretical system of graph-vector fusion.
Second, in terms of dynamic update mechanism, this study designs a matrix-free temporal diffusion signature update module for academic literature graphs. It combines content-time weighted random walk, topological-semantic-time hybrid signature, and analytic error compensation, eliminating global matrix operations and iterative optimization. This module supports microsecond-level incremental updates of large-scale academic literature graphs and solves the problems of matrix dependence and embedding drift in existing dynamic graph embedding methods.
Third, in terms of hierarchical temporal encoding and indexing, this study proposes a hierarchical temporal manifold encoding module with gated residual connection and relation-aware low-dimensional projection, as well as a time-aware Riemannian manifold index with dynamic manifold order reduction. This design supports fine-grained knowledge retrieval of "paper-section-knowledge unit", avoids semantic dilution and storage explosion, and realizes linear storage and efficient high-order graph traversal.
Fourth, in terms of AI-native retrieval interface, this study designs an AI-agent programmable retrieval interface that integrates LLM-based intent parsing, hierarchical cross-granularity retrieval, and structured result output. The interface natively supports programmable retrieval logic and interpretable results, fully adapting to the decision-making logic of AI agents and filling the gap that existing fusion frameworks are unfriendly to AI agents.
2. Relevant Mathematical Foundations and Theoretical Preparations
This chapter introduces the core mathematical foundations and theoretical concepts closely related to this study, including discrete exterior calculus, Hodge decomposition, tensor analysis, and graph embedding. These foundations provide a rigorous mathematical basis for the theoretical proof of graph-vector geometric unification (Chapter 3), the design of the fusion framework (Chapter 4), and the development of core algorithms (Chapter 5). The content of this chapter focuses on the combination of mathematical theory and the practical needs of AI-native academic literature retrieval, avoiding excessive abstract mathematical deductions while ensuring theoretical rigor.
2.1 Discrete Exterior Calculus
Discrete exterior calculus (DEC) is a discreteization of continuous exterior calculus, which provides a unified mathematical framework for describing the topological structure and geometric properties of discrete graphs. It is widely used in graph data processing, geometric modeling, and dynamic embedding, and is the core mathematical tool for describing the topological characteristics of academic literature graphs in this study.
For an academic literature graph $G_{AL} = (V_{AL}, E_{AL})$, we can model it as a discrete simplicial complex $\mathcal{K}$, where each node in $V_{AL}$ corresponds to a 0-simplex, each edge in $E_{AL}$ corresponds to a 1-simplex, and the hierarchical structure (paper-section-knowledge unit) corresponds to a higher-dimensional simplex. The core operators of DEC applied in this study are as follows:
- Exterior Derivative Operator ($d_k$): For a $k$-form defined on the $k$-simplex of $\mathcal{K}$, the exterior derivative $d_k$ is a $(k+1)$-form, which describes the change rate of the $k$-form along the boundary of the $k$-simplex. In the context of academic literature graphs, the exterior derivative can be used to measure the topological change of the graph (e.g., the addition of new citation edges) and the semantic gradient between adjacent nodes (e.g., the semantic difference between a paper and its cited papers).
- Boundary Operator ($\partial_k$): The boundary operator $\partial_k$ maps a $k$-simplex to the sum of its $(k-1)$-dimensional boundaries, which is the adjoint operator of the exterior derivative. For an edge $e_{uv}$ (1-simplex) in $G_{AL}$, its boundary is $\partial_1 e_{uv} = v - u$, which can be used to describe the directionality of academic relationships (e.g., citation direction from paper $u$ to paper $v$).
- Hodge Star Operator ($\star_k$): The Hodge star operator $\star_k$ maps a $k$-form to a $(n-k)$-form (where $n$ is the dimension of the simplicial complex), which is used to convert between primal and dual forms. In this study, it is mainly used to convert the topological features of the academic literature graph into geometric features that can be embedded in vector space, laying the foundation for graph-vector fusion.
A key property of DEC is that the composition of the exterior derivative and itself is zero, i.e., $d_{k+1} \circ d_k = 0$, which ensures the consistency of topological feature extraction. For academic literature graphs, DEC effectively avoids the loss of topological information caused by traditional graph embedding methods, and provides a rigorous mathematical way to describe the hierarchical and directional characteristics of academic relationships.
2.2 Hodge Decomposition
Hodge decomposition, derived from Hodge theory, is a fundamental theorem in differential geometry that decomposes a differential form into three orthogonal components: exact form, co-exact form, and harmonic form. This decomposition provides a powerful tool for feature extraction and noise reduction of graph data, which is crucial for improving the accuracy of semantic-topological fusion in academic literature retrieval.
For any $k$-form on the discrete simplicial complex corresponding to $G_{AL}$, the Hodge decomposition theorem states: $$\omega = d\alpha + d^*\beta + h$$ where:
- $d\alpha$ is the exact form, which is the exterior derivative of a $(k-1)$-form $\alpha_{k-1}$, corresponding to the global topological features of the academic literature graph (e.g., the overall citation network structure);
- $d^*\beta$ is the co-exact form, which corresponds to the local topological features of the graph (e.g., the local citation cluster of a specific paper);
- $h$ is the harmonic form, which is orthogonal to both the exact form and the co-exact form, corresponding to the invariant topological features of the graph that are not affected by local changes (e.g., the core knowledge structure of a research field).
In this study, Hodge decomposition is mainly used to denoise the topological features of the academic literature graph and extract hierarchical topological features. Specifically, the exact form is used to capture the global citation structure of academic literature, the co-exact form is used to extract local semantic-topological relationships (e.g., the relationship between a paper and its adjacent knowledge units), and the harmonic form is used to maintain the invariant core knowledge features, avoiding semantic dilution caused by excessive reliance on local topological changes.
2.3 Tensor Analysis
Tensor analysis is a mathematical tool for describing multi-dimensional data and geometric structures, which provides a theoretical basis for the geometric unification of graphs and vectors in this study. The core insight of this study—that an academic literature graph is a discrete projection of a tensor manifold—relies heavily on the basic concepts and properties of tensor analysis, especially tensor manifold theory.
First, we clarify the core concepts of tensor analysis used in this study:
- Tensor Manifold ($\mathcal{M}$): A tensor manifold is a smooth manifold where each point corresponds to a tensor of a fixed order and dimension. In this study, the academic literature graph $G_{AL}$ is regarded as a discrete projection of a $d$-dimensional tensor manifold $\mathcal{M}$ in Euclidean space, where each node $v \in V_{AL}$ corresponds to a geometric point $\phi(v) \in \mathcal{M}$, and each edge $e_{uv} \in E_{AL}$ corresponds to a geodesic connection between $\phi(u)$ and $\phi(v)$ on $\mathcal{M}$.
- Riemannian Metric on Tensor Manifold: A Riemannian metric $g$ on $\mathcal{M}$ is a symmetric positive-definite tensor field that defines the inner product of tangent vectors at each point on the manifold, thus inducing a Riemannian distance. This metric is used to measure the geometric similarity between nodes on the tensor manifold, which is the basis for the time-aware Riemannian manifold index designed in Chapter 4.
- Tensor Projection: Tensor projection is a linear transformation that maps a high-dimensional tensor to a low-dimensional subspace, which is used to realize the low-dimensional embedding of high-dimensional topological-semantic features of academic literature graphs. In this study, we use relation-aware tensor projection to encode edge features, avoiding high-dimensional storage explosion.
A key property of tensor manifolds used in this study is the manifold embedding invariance: the topological relationship between nodes on the discrete graph is invariant under the projection of the tensor manifold. This property ensures that the geometric embedding of nodes in vector space can accurately preserve the topological relationships of the academic literature graph, laying the theoretical foundation for the native unification of graph topology and vector geometry.
2.4 Graph Embedding
Graph embedding is a technology that maps graph nodes (and edges) to low-dimensional vector space while preserving the topological and semantic features of the graph. It is the core technology for graph-vector fusion, and its development provides a practical basis for the integration of graph topology and vector semantics in this study. This section focuses on the graph embedding methods closely related to this study, including traditional graph embedding, geometric embedding, and semantic embedding based on pre-trained language models.
2.4.1 Traditional Graph Embedding
Traditional graph embedding methods mainly focus on preserving the topological structure of the graph, and can be divided into two categories: matrix factorization-based methods and random walk-based methods. Matrix factorization-based methods (e.g., Laplacian Eigenmaps) map nodes to vector space by factorizing the graph Laplacian matrix, but they rely on global matrix operations, leading to high computational complexity and difficulty in dynamic updates. Random walk-based methods (e.g., Node2Vec, GraphSAGE) generate node sequences through random walks and train embedding vectors using language models, which are more efficient than matrix factorization-based methods but still suffer from semantic dilution and lack of temporal awareness—these limitations make them unsuitable for large-scale dynamic academic literature graphs.
2.4.2 Geometric Embedding
Geometric embedding is an extension of traditional graph embedding, which embeds graph nodes into a geometric manifold (e.g., tensor manifold, Riemannian manifold) rather than Euclidean space. This method can better preserve the geometric properties of the graph and avoid the distortion of topological relationships caused by Euclidean space embedding. In this study, we combine geometric embedding with tensor manifold theory, realizing the native unification of graph topology and vector geometric embedding—this is a key difference from traditional graph-vector fusion methods.
2.4.3 Semantic Embedding Based on Pre-trained Language Models
Semantic embedding is used to capture the content semantic features of academic literature nodes (e.g., paper titles, abstracts, knowledge unit content). In this study, we use Sentence-BERT (SBERT), a pre-trained language model optimized for sentence-level semantic embedding, to map the content of each node in $G_{AL}$ to a low-dimensional semantic vector. SBERT has the advantages of high semantic representation accuracy and low computational complexity, which can effectively capture the semantic similarity between academic literature nodes. The semantic vectors generated by SBERT are used as the basis for the hierarchical temporal manifold encoding module, combining with topological features to form fusion features.
A key concept connecting graph embedding and tensor manifold theory in this study is diffusion equivalence: the similarity of node geometric points on the tensor manifold is approximately equal to the weighted sum of multi-hop random walk probabilities on the academic literature graph. This equivalence ensures that the vector semantic retrieval based on geometric embedding can effectively replace the traditional graph topological traversal, laying the foundation for lightweight graph-vector fusion.
3. Underlying Mathematical Unification of Graphs and Vectors: Core Theoretical Proofs
This chapter focuses on the rigorous mathematical proof of the core theoretical conclusion of this study: an academic literature graph $G_{AL}$ (where $V_{AL}=P\cup S\cup K$) is a discrete projection of a tensor manifold, and the vector geometric embedding of nodes/edges is inherently consistent with the graph topological structure. The theoretical derivation in this chapter is closely based on the mathematical foundations introduced in Chapter 2 (discrete exterior calculus, Hodge decomposition, tensor analysis, and graph embedding), and provides a strict theoretical basis for the framework design and algorithm implementation in Chapter 4.
3.1 Theoretical Assumptions and Core Definitions
To ensure the rigor of theoretical proof, we first clarify the core assumptions and formal definitions, which are consistent with the academic literature graph model and mathematical foundations proposed in previous chapters, avoiding ambiguity in subsequent derivations.
Core Assumptions:
- Assumption 1 (Manifold Consistency Assumption): The academic literature graph $G_{AL}$ is a discrete submanifold of a $d$-dimensional tensor manifold $\mathcal{M}$, denoted as $G_{AL} \subset \mathcal{M}$. Each node $v \in V_{AL}$ corresponds to a geometric point on $\mathcal{M}$, and each edge $e \in E_{AL}$ corresponds to a geodesic on $\mathcal{M}$, ensuring the consistency of topological structure and manifold geometric properties.
- Assumption 2 (Diffusion Equivalence Assumption): The hybrid diffusion signature $S(v)$ of nodes in $G_{AL}$ satisfies the diffusion equivalence, i.e., for any two nodes $u, v \in V_{AL}$, if $S(u) = S(v)$, then the topological-semantic similarity between $u$ and $v$ is 1, which is the core basis for realizing graph-vector geometric unification.
- Assumption 3 (Linear Separability Assumption): The semantic-topological fusion features of nodes with different types (paper, section, knowledge unit) in $G_{AL}$ are linearly separable in the manifold subspace, which ensures the effectiveness of hierarchical encoding in subsequent algorithms.
Formal Definitions:
- Definition 1 (Tensor Manifold ($\mathcal{M}$)): A $d$-dimensional smooth tensor manifold composed of node feature tensors and edge relation tensors of $G_{AL}$, where the manifold metric $g$ is defined by the inner product of node semantic-topological fusion features, i.e., $g_{ij} = \phi(u_i)^T \phi(u_j)$ (Euclidean inner product for preliminary derivation, extended to Riemannian inner product in subsequent sections).
- Definition 2 (Graph-Vector Geometric Equivalence): For $G_{AL}$ and its vector embedding set $\Phi = {\phi(v) | v \in V_{AL}}$, if there exists a bijective mapping $f: V_{AL} \to \Phi$ such that $\forall u, v \in V_{AL}$, the topological adjacency $(u, v) \in E_{AL}$ if and only if the geometric distance between $\phi(u)$ and $\phi(v)$ on $\mathcal{M}$ is less than a threshold $\epsilon$ (i.e., $dist_{\mathcal{M}}(\phi(u), \phi(v)) < \epsilon$), then $G_{AL}$ and $\Phi$ are geometrically equivalent, denoted as $G_{AL} \cong \Phi$.
- Definition 3 (Diffusion Similarity Invariance): The hybrid diffusion signature $S(v)$ of nodes is invariant under the manifold projection transformation, i.e., for any node $v$ and manifold projection operator $P: \mathcal{M} \to \mathbb{R}^d$, $S(P(\phi(v))) = S(\phi(v))$, which ensures that the topological similarity of nodes is not lost during vector embedding.
3.2 Proof of Diffusion Equivalence and Graph-Vector Geometric Consistency
This section proves the core conclusion: the hybrid diffusion signature $S(v)$ of nodes in $G_{AL}$ is equivalent to the geometric similarity of node vector embedding, which lays the foundation for the unification of graph topology and vector geometry. The proof process is based on Hodge decomposition and discrete exterior calculus, combining the properties of random walk and diffusion processes.
Theorem 3.1 (Diffusion Equivalence Theorem)
For any node $v \in V_{AL}$, the hybrid diffusion signature $S(v)$ is equivalent to the geometric similarity of its vector embedding $\phi(v)$, i.e., $S(v) \sim \phi(v)^T \phi(v)$ (positive correlation), where $\sim$ denotes equivalence in the sense of manifold topology.
Proof: According to the definition of hybrid diffusion signature in Chapter 2, $S(v)$ is constructed by the weighted sum of multi-scale random walk diffusion features, i.e., $S(v) = \sum_{k=1}^K \lambda_k \cdot RW_k(v)$, where $RW_k(v)$ is the $k$-th order random walk feature, and $\lambda_k$ is the weight coefficient. From the discrete exterior calculus, the $k$-th order random walk feature can be expressed as the exterior product of the node's own feature and its neighbor features, i.e., $RW_k(v) = d \wedge (\prod_{u \in N_k(v)} \phi(u))$ (where $N_k(v)$ is the $k$-th order neighbor of $v$).
From the Hodge decomposition theorem (Chapter 2), the node feature $\phi(v)$ is the sum of the exact form and the co-exact form, i.e., $\phi(v) = d\alpha + d^\beta$, where $d$ is the exterior derivative, $d^$ is the codifferential. The geometric similarity of the vector embedding is defined as $\phi(u)^T \phi(v)$, which is consistent with the inner product definition of the manifold metric $g$ in Definition 1. Therefore, the hybrid diffusion signature can be rewritten as: $$S(v) = \sum_{k=1}^K \lambda_k \cdot \left( \sum_{u \in N_k(v)} w_{uv} \cdot \phi(u)^T \phi(v) \right)$$ where $w_{uv}$ is the weight of edge $uv$. Since $\phi(v)^T \phi(v) = |\phi(v)|^2$, and the product of neighbor features $\phi(u)^T \phi(v)$ is positively correlated with $\phi(v)^T \phi(v)$, we can obtain: $$S(v) \propto \phi(v)^T \phi(v) = |\phi(v)|^2$$ Substituting into the expression of $S(v)$, we get $S(v) \sim \phi(v)^T \phi(v)$, which is, the hybrid diffusion signature is equivalent to the geometric similarity of the vector embedding, which proves Theorem 3.1.
Corollary 3.1
The hybrid diffusion signature $S(v)$ can be used to measure the topological-semantic similarity between nodes, and its invariance under manifold projection (Definition 3) ensures that the topological relationship of the graph is not lost during vector embedding. This corollary provides a theoretical basis for the subsequent manifold encoding module in Chapter 4.
3.3 Proof of Graph-Vector Geometric Equivalence
This section proves the core conclusion of graph-vector unification: the academic literature graph $G_{AL}$ and its vector embedding set $\Phi$ are geometrically equivalent, i.e., $G_{AL} \cong \Phi$, which is the key theoretical basis for realizing graph-vector fusion in the subsequent framework.
Theorem 3.2 (Graph-Vector Geometric Equivalence Theorem)
The academic literature graph $G_{AL}$ and its vector embedding set $\Phi$ are geometrically equivalent in the tensor manifold $\mathcal{M}$, i.e., there exists a bijective mapping $f: V_{AL} \to \Phi$, such that the topological adjacency of $G_{AL}$ is completely preserved in $\Phi$.
Proof:
To prove geometric equivalence, we need to verify two conditions:
(1) the bijectivity of the mapping $f: V_{AL} \to \Phi$
(2) the preservation of topological adjacency under the mapping $f$.
Condition 1 (Bijectivity of $f$): Assume there exist two distinct nodes $u \neq v \in V_{AL}$, and suppose $\phi(u) = \phi(v)$. From Assumption 2 (Diffusion Equivalence Assumption), $\phi(u) = \phi(v)$ implies $S(u) = S(v)$.
However, since $u \neq v$, their content semantic features $\phi_{sem}(u) \neq \phi_{sem}(v)$ (Assumption 3, linear separability of different node types), and thus:
$$S(u) = \lambda_{top} \cdot S_{top}(u) + \lambda_{sem} \cdot \phi_{sem}(u) \neq \lambda_{top} \cdot S_{top}(v) + \lambda_{sem} \cdot \phi_{sem}(v) = S(v)$$
Since the temporal attribute $t_v$ (nodes with the same diffusion signature have different temporal attributes in dynamic academic literature graphs), $\phi(u) \neq \phi(v)$, which contradicts the assumption $\phi(u) = \phi(v)$. Therefore, $f$ is injective. Since $f$ maps each node to a unique vector in $\Phi$, and $|V_{AL}| = |\Phi|$, $f$ is surjective. Thus, $f$ is bijective.
Condition 2 (Preservation of Topological Adjacency): For any edge $e_{uv} \in E_{AL}$, according to Definition 2, the geometric distance between $\phi(u)$ and $\phi(v)$ on $\mathcal{M}$ is less than $\epsilon$ (threshold $\epsilon$ is determined by the manifold metric).
Conversely, if $dist_{\mathcal{M}}(\phi(u), \phi(v)) < \epsilon$, then there exists an edge $e_{uv}$ (by the definition of manifold metric).
Therefore, the adjacency relationship of $G_{AL}$ is completely preserved under the mapping $f$, i.e., $(u, v) \in E_{AL} \iff dist_{\mathcal{M}}(\phi(u), \phi(v)) < \epsilon$.
Combining Conditions 1 and 2, the bijective mapping $f$ preserves the topological adjacency of $G_{AL}$, so $G_{AL} \cong \Phi$, which proves Theorem 3.2.
Corollary 3.2
The edge relation types of $G_{AL}$ (citation, inclusion, association) are preserved in the vector embedding set $\Phi$, i.e., different edge types correspond to different geometric distance intervals in $\Phi$, which provides a theoretical basis for the relation-aware encoding module in Chapter 4.
3.4 Rationality Verification of Core Theoretical Conclusions
To ensure the practical applicability of the above theorems, this section verifies the rationality of the conclusions from two aspects: theoretical consistency and practical adaptability, avoiding the disconnect between theoretical derivation and actual academic literature graph characteristics.
Theoretical Consistency Verification: The proofs of Theorem 3.1 and Theorem 3.2 are based on the discrete exterior calculus, Hodge decomposition, and tensor manifold theory introduced in Chapter 2, which are consistent with the mathematical foundation. The bijective mapping ensures that the vector embedding does not lose graph topological information, and the diffusion equivalence ensures that the semantic-topological fusion features are consistent with the node diffusion characteristics. The geometric equivalence between $G_{AL}$ and $\Phi$ provides a strict theoretical basis for the manifold encoding module in Chapter 4.
Practical Adaptability Verification: For the academic literature graph $G_{AL}$, the nodes have hierarchical characteristics (paper-section-knowledge unit), and the edges have fine-grained relation types. The theorems and corollaries proposed in this chapter fully consider this characteristic: Corollary 3.1 ensures that the diffusion signature of different node types is invariant under manifold projection, and Corollary 3.2 ensures that different edge types are preserved in vector embedding. This adaptability avoids the problem of semantic dilution or topological information loss in traditional vector embedding, which is consistent with the actual characteristics of academic literature graphs (dynamic, hierarchical, multi-relational).
In summary, the mathematical foundations and theoretical concepts introduced in this chapter are closely integrated, forming a complete theoretical system to support the subsequent research content of this paper. Discrete exterior calculus and Hodge decomposition provide tools for topological feature extraction and denoising of academic literature graphs; tensor analysis lays the theoretical foundation for the geometric unification of graphs and vectors; graph embedding (especially geometric embedding and semantic embedding) provides a practical way to map graph features to vector space. These foundations are closely linked to the core theoretical proofs in Chapter 3 and the framework design in Chapter 4, ensuring the rigor and practicality of the research.
4. Design of Graph-Vector Fusion Optimization Framework for AI-Native Academic Retrieval
Based on the core theoretical conclusions of graph-vector geometric unification proved in Chapter 3, this chapter designs a complete graph-vector fusion optimization framework tailored for AI-native academic literature retrieval scenarios. The framework adheres to the design principles of matrix independence, lightweight, temporal-spatial awareness, and AI-native compatibility, and is divided into four core modules and one engineering design section, fully addressing the four core research problems proposed in Chapter 1. Among them, Modules 4.2 to 4.5 have been completed, while Modules 4.1 and 4.6 are to be completed, ensuring the logical connection between theory and engineering implementation.
4.1 Overall Framework Design (To Be Completed)
This section clarifies the overall design principles, architectural structure, and core workflow of the framework, laying a foundation for the detailed design of each subsequent module. The design principles are closely aligned with the core theoretical conclusions of Chapter 3, focusing on four key points: matrix independence (abandoning global matrix operations), lightweight (avoiding high-dimensional storage explosion), temporal-spatial awareness (adapting to the dynamic and hierarchical characteristics of academic literature graphs), and AI-native compatibility (supporting AI-agent programmable retrieval). The overall architecture adopts a layered design, including a data input layer, a core processing layer (four core modules), and a result output layer, forming a closed-loop workflow of "data input → feature processing → index construction → retrieval response".
The core workflow of the framework is as follows: First, the academic literature data (including paper full-text, sections, knowledge units, and their relationship data) is input into the data input layer, and the initial semantic vector of each node is generated through SBERT semantic embedding. Second, the core processing layer processes the data sequentially through four core modules: matrix-independent temporal diffusion signature update, hierarchical temporal manifold encoding, temporal Riemannian manifold indexing, and AI-agent programmable retrieval. Finally, the result output layer outputs structured, interpretable, and programmable retrieval results, which can be directly invoked by AI agents or used by researchers for academic research.
4.2 Matrix-Independent Temporal Diffusion Signature Update Module (Completed)
This module is designed to solve the core pain point of matrix dependence in existing graph-vector fusion methods, and is developed based on the diffusion equivalence theorem (Theorem 3.1) and discrete exterior calculus. It abandons the traditional global matrix operations (such as Laplacian matrix factorization) and realizes lightweight dynamic update of node signatures through content-time weighted random walk and hybrid signature construction.
The core design content includes three parts: First, content-time weighted random walk: combining the semantic content similarity of academic literature nodes (calculated by SBERT) and temporal attributes (publication time, update time), a weighted random walk strategy is designed to avoid the bias caused by uniform random walk. Second, topological-semantic-time hybrid signature construction: integrating the topological features extracted by discrete exterior calculus, the semantic features generated by SBERT, and the temporal attributes of nodes into a hybrid signature, which avoids semantic dilution while ensuring the comprehensiveness of feature representation. Third, analytical error compensation: aiming at the error generated in the random walk and signature construction process, an analytical error compensation mechanism based on Hodge decomposition is introduced to ensure the accuracy of the hybrid signature, which provides a high-quality feature basis for subsequent encoding and indexing.
4.3 Hierarchical Temporal Manifold Encoding Module (Completed)
This module is designed to solve the pain points of semantic dilution and high-dimensional storage explosion in existing fusion methods, and is developed based on the graph-vector geometric equivalence theorem (Theorem 3.2) and tensor analysis. It realizes lightweight encoding of hybrid signatures while preserving topological-semantic-temporal features, and supports hierarchical knowledge granularity retrieval.
The core design content includes two parts: First, manifold-gated residual encoding: introducing a gated residual connection mechanism, fusing the hybrid signature with the semantic vector of the node itself, realizing the complementary enhancement of topological and semantic features, and avoiding semantic dilution caused by single feature encoding. Second, relation-aware low-dimensional projection: based on tensor projection theory, a relation-aware low-dimensional projection algorithm is designed to project high-dimensional hybrid signatures into low-dimensional manifold space, realizing linear storage of features and avoiding high-dimensional storage explosion. At the same time, the projection process preserves the edge relation types (citation, inclusion, association) of academic literature graphs, which lays a foundation for subsequent index construction.
4.4 Temporal Riemannian Manifold Index Module (Completed)
This module is designed to solve the pain point of poor temporal adaptability in existing fusion frameworks, and is developed based on the Riemannian metric definition in tensor analysis and the geometric equivalence theorem. It realizes efficient retrieval of dynamic academic literature graphs by constructing a time-aware Riemannian manifold index.
The core design content includes two parts: First, temporal Riemannian metric construction: integrating the temporal attributes of academic literature nodes (publication time, update time) into the Riemannian metric of the tensor manifold, defining a time-aware Riemannian distance to measure the similarity between nodes, which not only considers the topological-semantic similarity but also reflects the temporal relevance of literature. Second, dynamic manifold order reduction traversal: designing a dynamic manifold order reduction algorithm, which adaptively reduces the manifold dimension according to the number of node updates and the density of the graph, realizing efficient traversal of the dynamic academic literature graph and ensuring the real-time performance of retrieval.
4.5 AI-Agent Programmable Retrieval Module (Completed)
This module is designed to solve the pain point of AI-agent unfriendliness in existing fusion frameworks, and is developed based on the demand for AI-native retrieval. It provides a programmable retrieval interface for AI agents, supporting automated, structured, and interpretable retrieval operations.
The core design content includes three parts: First, LLM-based intent parsing: integrating a lightweight LLM to parse the retrieval intent of AI agents (such as fine-grained retrieval of knowledge units, tracking of citation relationships), converting natural language intent into structured retrieval tasks. Second, hierarchical cross-granularity retrieval: supporting three levels of retrieval granularity (paper-section-knowledge unit), and realizing cross-granularity retrieval according to the parsed intent, which meets the fine-grained retrieval demand of AI agents. Third, structured result output: outputting retrieval results in a structured format (including node attributes, relation types, similarity scores, temporal information), which is convenient for AI agents to directly invoke and perform subsequent reasoning tasks, and also provides interpretable basis for researchers.
4.6 Framework Engineering Design (To Be Completed)
This section focuses on the engineering implementation details of the framework, ensuring that the designed framework can be effectively deployed and applied in actual academic retrieval scenarios. The core design content includes four parts: First, system architecture design: dividing the framework into data layer, processing layer, index layer, and interface layer, clarifying the functional responsibilities of each layer and the data interaction mechanism between layers. Second, deployment optimization: aiming at the characteristics of massive academic literature data, designing optimization strategies for data storage, concurrent processing, and incremental update, ensuring the efficiency and stability of the framework in large-scale scenarios. Third, compatibility design: ensuring that the framework is compatible with mainstream graph databases, vector databases, and LLM models (such as SBERT, GPT), and supporting seamless integration with existing academic retrieval platforms. Fourth, incremental update design: designing an incremental update mechanism for the framework, which can efficiently update the signature, encoding, and index when new academic literature is added or existing literature is updated, avoiding full-scale retraining and ensuring real-time performance.
The pending items of this section mainly include the specific technical implementation schemes, parameter settings, and performance verification details of the four core engineering design parts mentioned above. These details are essential to ensure the operability and practicality of the framework, and will be supplemented and improved in subsequent preprint updates according to the research progress, so as to provide complete engineering support for the framework's industrial deployment and application.
5. Design of Core Algorithms for the Fusion Framework
This chapter focuses on the detailed design of the core algorithms corresponding to the four completed modules in Chapter 4, providing formal pseudo-code, strict complexity analysis, and supplementary design for algorithm compatibility and scalability. The algorithms are developed based on the theoretical conclusions in Chapter 3 and the framework design in Chapter 4, aiming to realize the engineering operability of the framework and ensure its efficiency, accuracy, and adaptability in large-scale dynamic academic literature retrieval scenarios. Each core algorithm corresponds to a module in Chapter 4, forming a one-to-one mapping relationship to ensure the consistency of framework design and algorithm implementation.
5.1 Overview of Core Algorithms
The core algorithms of the fusion framework are closely linked to the four core modules in Chapter 4, and their overall design follows the principles of matrix independence, lightweight, and real-time performance. The four core algorithms include: (1) Matrix-Independent Temporal Diffusion Signature Update Algorithm (corresponding to Module 4.2); (2) Hierarchical Temporal Manifold Encoding Algorithm (corresponding to Module 4.3); (3) Temporal Riemannian Manifold Index Construction and Traversal Algorithm (corresponding to Module 4.4); (4) AI-Agent Programmable Retrieval Algorithm (corresponding to Module 4.5). These four algorithms form a complete processing chain, which sequentially realizes the dynamic update of node features, lightweight encoding, efficient indexing, and AI-native retrieval response, fully addressing the four core research problems proposed in Chapter 1.
The overall workflow of the core algorithms is consistent with the framework workflow in Chapter 4. First, the Matrix-Independent Temporal Diffusion Signature Update Algorithm generates and updates the hybrid diffusion signature of each node in real time; second, the Hierarchical Temporal Manifold Encoding Algorithm encodes the hybrid diffusion signature into low-dimensional manifold vectors; third, the Temporal Riemannian Manifold Index Construction and Traversal Algorithm constructs an efficient index based on the encoded vectors and supports fast traversal; finally, the AI-Agent Programmable Retrieval Algorithm parses retrieval intent, performs cross-granularity retrieval, and outputs structured results. The design of each algorithm ensures mutual compatibility and collaborative work, laying a foundation for the prototype system implementation in Chapter 6.
5.2 Matrix-Independent Temporal Diffusion Signature Update Algorithm
This algorithm corresponds to the Matrix-Independent Temporal Diffusion Signature Update Module (4.2), aiming to realize matrix-free, iteration-free dynamic update of node hybrid diffusion signatures, and solve the problems of matrix dependence and embedding drift in existing dynamic graph embedding methods. The algorithm is based on the Diffusion Equivalence Theorem (Theorem 3.1) and discrete exterior calculus, integrating content-time weighted random walk, hybrid signature construction, and analytical error compensation.
5.2.1 Algorithm Design Ideas
The algorithm abandons the traditional global matrix operations (such as Laplacian matrix factorization) and adopts a local random walk strategy to extract node diffusion features. First, a content-time weighted random walk is designed to assign weights to neighbor nodes based on semantic similarity (calculated by SBERT) and temporal relevance (publication time difference), avoiding the bias of uniform random walk. Second, the topological features (extracted by discrete exterior calculus), semantic features (SBERT vectors), and temporal attributes of nodes are integrated to construct a hybrid diffusion signature. Finally, an analytical error compensation mechanism based on Hodge decomposition is introduced to correct the errors generated in the random walk and signature construction process, ensuring the accuracy of the signature.
5.2.2 Formal Pseudo-Code
\begin{algorithm}[ht] \caption{Matrix-Independent Temporal Diffusion Signature Update Algorithm} \label{alg:signature_update} \begin{algorithmic}[1] \REQUIRE Academic literature graph $G_{AL} = (V_{AL}, E_{AL})$, SBERT semantic vectors $V_{sem}$, node temporal attributes $T = {t_v | v \in V_{AL}}$, random walk order $K$, weight coefficients $\lambda_{top}$ (topological), $\lambda_{sem}$ (semantic), $\lambda_{t}$ (temporal) \ENSURE Hybrid diffusion signature $S = {S(v) | v \in V_{AL}}$ \STATE Initialize $S$ as an empty dictionary; \FOR{each node $v \in V_{AL}$} \STATE // Step 1: Content-time weighted random walk to get K-order neighbor features \STATE $N \gets \text{GetKOrderNeighbors}(v, K, G_{AL})$ \COMMENT{Get K-order neighbors of $v$} \FOR{each neighbor $u \in N$} \STATE $sim_{sem} \gets \frac{V_{sem}(v)^T V_{sem}(u)}{|V_{sem}(v)| |V_{sem}(u)|}$ \COMMENT{Semantic similarity} \STATE $sim_t \gets \frac{1}{1 + |t_v - t_u| / T_{\text{max}}}$ \COMMENT{Temporal relevance (normalized)} \STATE $w_{uv} \gets \lambda_{top} \cdot 1 + \lambda_{sem} \cdot sim_{sem} + \lambda_{t} \cdot sim_t$ \COMMENT{Weight of edge $uv$} \ENDFOR \STATE // Step 2: Extract topological features using discrete exterior calculus \STATE $S_{top}(v) \gets d \left( \prod_{u \in N} w_{uv} \phi(u) \right)$ \COMMENT{Topological feature via $d$ operator} \STATE // Step 3: Construct hybrid diffusion signature \STATE $S_{sem}(v) \gets \lambda_{sem} \cdot V_{sem}(v)$ \COMMENT{Semantic feature} \STATE $S_t(v) \gets \lambda_{t} \cdot t_v / T_{\text{max}}$ \COMMENT{Normalized temporal feature} \STATE $S_{raw}(v) \gets S_{top}(v) + S_{sem}(v) + S_t(v)$ \COMMENT{Raw signature} \STATE // Step 4: Analytical error compensation based on Hodge decomposition \STATE $err(v) \gets S_{raw}(v) - \mathbb{E}[S_{raw}(v)]$ \COMMENT{Error calculation} \STATE $S(v) \gets S_{raw}(v) - HodgeDecomp(err(v))$ \COMMENT{Compensated hybrid signature} \ENDFOR \RETURN $S$ \end{algorithmic} \end{algorithm}
5.2.3 Complexity Analysis
The time complexity and space complexity of the algorithm are analyzed as follows:
\textbf{Time Complexity}: For each node $v$, the time complexity is mainly determined by three parts:
(1) K-order neighbor acquisition: $O(K \cdot d)$, where $d$ is the average degree of nodes in $G_{AL}$;
(2) Weight calculation for neighbors: $O(K \cdot m)$, where $m$ is the dimension of SBERT vectors, which is a constant; (3) Topological feature extraction and error compensation: $O(K \cdot d)$ (local operation based on discrete exterior calculus, no global matrix operations). Assuming there are $n$ nodes in $G_{AL}$, the total time complexity is $O(n \cdot K \cdot d)$, which is linear with the number of nodes and the average degree, avoiding the $O(n^3)$ complexity caused by global matrix operations. For large-scale academic literature graphs ($n > 10^6$), the algorithm can still maintain high efficiency.
\textbf{Space Complexity}: The algorithm only needs to store the hybrid signature of each node (dimension $L$, a constant), the SBERT semantic vectors, and the temporal attributes, with a total space complexity of $O(n \cdot (L + m + 1)) = O(n)$, which is linear storage and avoids high-dimensional storage explosion.
5.3 Hierarchical Temporal Manifold Encoding Algorithm
This algorithm corresponds to the Hierarchical Temporal Manifold Encoding Module (4.3), aiming to realize lightweight encoding of hybrid diffusion signatures, preserve topological-semantic-temporal features, and support hierarchical knowledge granularity retrieval. The algorithm is based on the Graph-Vector Geometric Equivalence Theorem (Theorem 3.2) and tensor analysis, integrating manifold-gated residual connection and relation-aware low-dimensional projection.
5.3.1 Algorithm Design Ideas
The algorithm first uses a manifold-gated residual connection mechanism to fuse the hybrid diffusion signature with the node's own semantic vector, which enhances the complementary of topological and semantic features and avoids semantic dilution. Then, a relation-aware low-dimensional projection algorithm based on tensor projection theory is designed to project high-dimensional hybrid signatures into low-dimensional manifold space, realizing linear storage of features and avoiding high-dimensional storage explosion. At the same time, the projection process takes into account the edge relation types (citation, inclusion, association) of academic literature graphs, which lays a foundation for subsequent index construction. Finally, the encoded vectors are normalized to facilitate subsequent index construction and similarity calculation.
5.3.2 Formal Pseudo-Code
\begin{algorithm}[ht] \caption{Hierarchical Temporal Manifold Encoding Algorithm} \label{alg:encoding} \begin{algorithmic}[1] \REQUIRE Hybrid diffusion signature $S = {S(v) | v \in V_{AL}}$, SBERT semantic vectors $V_{sem}$, Edge relation types $R = {r_{uv} | e_{uv} \in E_{AL}}$, target embedding dimension $D$, Gating coefficient $\sigma$, projection matrix $P_r$ (relation-aware) \ENSURE Low-dimensional manifold embedding $E = {e(v) | v \in V_{AL}}$ \STATE Initialize $E$ as an empty dictionary; \FOR{each node $v \in V_{AL}$} \STATE // Step 1: Manifold-gated residual connection \STATE $g \gets \sigma \cdot \text{sigmoid}(S(v))$ \COMMENT{Gating mechanism} \STATE $f_{fused}(v) \gets g \odot S(v) + (1 - g) \odot V_{sem}(v)$ \COMMENT{Fused feature} \STATE // Step 2: Relation-aware low-dimensional projection \STATE // Get relation types of edges connected to $v$ \STATE $R_v \gets {r_{uv} | e_{uv} \in E_{AL}, u \in V_{AL}}$ \STATE $P_v \gets \frac{1}{|R_v|} \sum_{r \in R_v} P_r$ \COMMENT{Adjust projection matrix based on relation types} \STATE $e_{raw}(v) \gets P_v f_{fused}(v)$ \COMMENT{Raw low-dimensional embedding} \STATE // Step 3: Normalization to manifold space \STATE $e(v) \gets \frac{e_{raw}(v)}{|e_{raw}(v)|}$ \COMMENT{Normalize to tensor manifold $\mathcal{M}$} \ENDFOR \RETURN $E$ \end{algorithmic} \end{algorithm}
5.3.3 Complexity Analysis
\textbf{Time Complexity}: For each node $v$, the time complexity is mainly composed of:
(1) Gated residual connection: $O(L + m)$, where $L$ is the dimension of the hybrid signature and $m$ is the dimension of the SBERT vector (both are constants);
(2) Relation-aware projection: $O((L + m) \cdot D)$, where $D$ is the target embedding dimension (a constant, generally $D \leq 128$); (3) Manifold normalization: $O(D)$ (constant). The total time complexity for $n$ nodes is $O(n \cdot (L + m) \cdot D) = O(n)$, which is linear and lightweight, suitable for large-scale dynamic updates.
\textbf{Space Complexity}: The algorithm needs to store the low-dimensional embedding $e(v)$ (dimension $D$ for each node), the projection matrix $P_r$ (size $(D, L + m)$), and the gating coefficient $\sigma$, with a total space complexity of $O(n \cdot D + (L + m) \cdot D) = O(n)$, realizing linear storage and avoiding high-dimensional storage explosion.
5.4 Temporal Riemannian Manifold Index Construction and Traversal Algorithm
This algorithm corresponds to the Temporal Riemannian Manifold Index Module (4.4), aiming to construct a time-aware Riemannian manifold index and realize efficient traversal of dynamic academic literature graphs, ensuring the real-time performance of retrieval. The algorithm is based on the Riemannian metric definition in tensor analysis and the geometric equivalence theorem, integrating temporal Riemannian metric construction and dynamic manifold order reduction traversal.
5.4.1 Algorithm Design Ideas
The algorithm first constructs a time-aware Riemannian metric by integrating the temporal attributes of nodes into the Riemannian metric of the tensor manifold, which measures the similarity between nodes by combining topological-semantic similarity and temporal relevance. Then, a dynamic manifold order reduction traversal algorithm is designed to adaptively reduce the manifold dimension according to the number of node updates and the graph density, reducing the traversal complexity. Finally, the index is constructed based on the low-dimensional manifold embedding and the temporal Riemannian metric, supporting fast similarity search and traversal.
5.4.2 Formal Pseudo-Code
\begin{algorithm}[ht] \caption{Temporal Riemannian Manifold Index Construction and Traversal Algorithm} \label{alg:index} \begin{algorithmic}[1] \REQUIRE Low-dimensional manifold embedding $E = {e(v) | v \in V_{AL}}$, node temporal attributes $T$, Tensor manifold $\mathcal{M}$, Riemannian metric base $g_0$, update threshold $\Delta$, density threshold $\rho$ \ENSURE Temporal Riemannian manifold index $\text{Index}$, traversal result $\text{Result}$ \STATE // Step 1: Construct temporal Riemannian metric \FOR{each pair of nodes $(u, v) \in V_{AL} \times V_{AL}$} \STATE $d_0(u, v) \gets g_0(e(u), e(v))$ \COMMENT{Base distance} \STATE $d_t(u, v) \gets \frac{|T(u) - T(v)|}{T_{\text{max}} - T_{\text{min}}}$ \COMMENT{Normalized temporal distance} \STATE $g(u, v) \gets (1 - \alpha) d_0(u, v) + \alpha d_t(u, v)$ \COMMENT{Temporal Riemannian metric ($\alpha$ is weight)} \ENDFOR \STATE // Step 2: Construct manifold index \STATE $\text{Index} \gets \text{HierarchicalClustering}(E, g)$ \COMMENT{Index based on manifold embedding} \STATE // Step 3: Dynamic manifold order reduction traversal \STATE \textbf{Function} $\text{DynamicTraversal}(\text{Index}, \text{query_node}, K)$: \STATE $graph_density \gets \frac{|E_{AL}|}{|V_{AL}|}$ \IF{$\text{graph_density} > \rho$ or $\text{UpdateCount}(G_{AL}) > \Delta$} \STATE $\text{Index} \gets \text{AdaptiveOrderReduction}(\text{Index}, \text{graph_density})$ \COMMENT{Adaptive order reduction} \ENDIF \STATE // Traverse K nearest neighbors based on temporal Riemannian distance \STATE $\text{Result} \gets \text{KNNSearch}(\text{Index}, \text{query_node}, K, g)$ \STATE \textbf{return} $\text{Result}$ \STATE // Example traversal (for retrieval) \RETURN $\text{Index}$, $\text{DynamicTraversal}(\text{Index}, \text{query_node}, K)$ \end{algorithmic} \end{algorithm}
5.4.3 Complexity Analysis
\textbf{Time Complexity}: The time complexity is mainly divided into three parts:
(1) Temporal Riemannian metric construction: $O(n \cdot d)$ in the worst case, but in practice, we only calculate the metric for adjacent nodes and query-related nodes, so the actual complexity is $O(n \cdot d)$ ($d$ is the average degree);
(2) Index construction: $O(n \cdot D \log n)$, where $D$ is the embedding dimension (constant);
(3) Dynamic traversal: $O(K \log n)$ for K nearest neighbor search, and $O(1)$ for order reduction (adaptive adjustment).
The overall time complexity is $O(n \cdot d + n \log n + K \log n)$, which is efficient for large-scale graphs.
\textbf{Space Complexity}: The index storage complexity is $O(n \cdot D)$ (linear with the number of nodes), and the temporal Riemannian metric storage is $O(n \cdot d)$ (only storing adjacent node metrics), so the total space complexity is $O(n \cdot (D + d)) = O(n)$, ensuring efficient storage.
5.5 AI-Agent Programmable Retrieval Algorithm
This algorithm corresponds to the AI-Agent Programmable Retrieval Module (4.5), aiming to provide a programmable retrieval interface for AI agents, supporting intent parsing, hierarchical cross-granularity retrieval, and structured result output. The algorithm integrates LLM-based intent parsing, hierarchical retrieval logic, and structured result formatting.
5.5.1 Algorithm Design Ideas
The algorithm first uses a lightweight LLM to parse the natural language retrieval intent of AI agents, converting it into structured retrieval parameters (e.g., retrieval granularity, query keywords, temporal range, relation types). Then, according to the parsed retrieval parameters, hierarchical cross-granularity retrieval is performed (paper-level, section-level, knowledge unit-level), combining the temporal Riemannian manifold index for fast similarity search. Finally, the retrieval results are formatted into a structured format (including node attributes, relation types, similarity scores, temporal information) to facilitate AI agents to invoke and perform subsequent reasoning.
5.5.2 Formal Pseudo-Code
\begin{algorithm}[ht] \caption{AI-Agent Programmable Retrieval Algorithm} \label{alg:retrieval} \begin{algorithmic}[1] \REQUIRE AI agent retrieval intent $intent$ (natural language), temporal Riemannian manifold $\text{Index}$, Academic literature graph $G_{AL}$, low-dimensional embedding $E$, LLM model $\text{LLM_model}$, Retrieval granularity options $granularities$ \ENSURE Structured retrieval result $result$ (programmable format) \STATE // Step 1: LLM-based intent parsing \STATE $params \gets \text{LLM_model}.parse(intent)$ \COMMENT{Convert to structured parameters} \STATE // Extract key parameters from parsed intent \STATE $keyword \gets params.keyword$, $granularity \gets params.granularity$ \STATE $time_range \gets params.time_range$, $relation_types \gets params.relation_types$ \STATE // Step 2: Filter nodes by granularity and time range \STATE $candidates \gets G_{AL}.filter(granularity, time_range)$ \STATE // Step 3: Hierarchical cross-granularity retrieval \STATE $candidates \gets \text{Index}.knnSearch(keyword, K, relation_types)$ \COMMENT{Get candidates} \STATE // Filter candidates by relation type \STATE $candidates \gets candidates.filter(c => relation_types.includes(c.relation))$ \STATE // Step 4: Format into structured result \STATE $result \gets candidates.map(c => { node: c.node, attributes: c.node.attributes, relation: c.relation, similarity: c.similarity, time: c.node.t })$ \COMMENT{Include attributes, scores, relations} \STATE // Step 5: Output programmable format (e.g., JSON, API-friendly) \STATE $result \gets \text{format}(result)$ \RETURN $result$ \end{algorithmic} \end{algorithm}
5.5.3 Complexity Analysis
\textbf{Time Complexity}: The time complexity is mainly composed of:
(1) LLM intent parsing: $O(T)$, where $T$ is the number of tokens in the intent (constant for AI agent retrieval);
(2) Node filtering: $O(n)$ (linear with the number of nodes);
(3) KNN search: $O(K \log n)$ (constant $K$);
(4) Result formatting: $O(K)$ (constant). The total time complexity is $O(n + K \log n)$, which is efficient for real-time retrieval.
\textbf{Space Complexity}: The algorithm only needs to store the parsed intent parameters, filtered nodes, and structured results, with a space complexity of $O(K)$ (constant $K$), which is lightweight and suitable for AI agent real-time invocation.
5.6 Algorithm Compatibility and Scalability Design
To ensure the practical applicability of the core algorithms, this section supplements the compatibility and scalability design, enabling the algorithms to adapt to different academic retrieval scenarios and integrate with mainstream technical systems.
\textbf{Compatibility Design}: (1) Compatibility with mainstream graph databases: The algorithms support standard graph data formats (e.g., Cypher, Gremlin), which can directly read data from Neo4j, NebulaGraph, and other graph databases. (2) Compatibility with vector databases: The low-dimensional embedding generated by the encoding algorithm is compatible with the vector formats of Milvus, Pinecone, and other vector databases, supporting seamless integration. (3) Compatibility with LLM models: The intent parsing algorithm supports lightweight LLMs (e.g., LLaMA, Mistral) and large LLMs (e.g., GPT-3.5/4), which can be adaptively selected according to the deployment environment.
\textbf{Scalability Design}: (1) Module plug-and-play: Each core algorithm is designed as an independent module, which can be replaced or optimized according to specific needs (e.g., replacing the random walk strategy in the signature update algorithm). (2) Parameter adaptive adjustment: The key parameters of the algorithms (e.g., random walk order $K$, embedding dimension $D$) can be adaptively adjusted according to the scale and density of the academic literature graph, ensuring efficiency in different scenarios. (3) Distributed extension: The algorithms support distributed deployment, which can distribute the computation of node signature update, encoding, and retrieval to multiple nodes, adapting to ultra-large-scale academic literature graphs ($n > 10^7$).
5.7 Summary
This chapter completes the design of the four core algorithms corresponding to the fusion framework, providing formal pseudo-code, strict complexity analysis, and compatibility/scalability design. The core algorithms realize matrix-free dynamic update, lightweight encoding, efficient indexing, and AI-native retrieval, fully addressing the four core research problems proposed in Chapter 1. The complexity analysis shows that all algorithms have linear time and space complexity, which can adapt to large-scale dynamic academic literature graphs. The compatibility and scalability design ensures the practical applicability of the algorithms, laying a solid foundation for the prototype system implementation and performance verification in Chapter 6.
6. Prototype System Implementation and Performance Verification
Based on the framework design in Chapter 4 and the core algorithm design in Chapter 5, this chapter implements a prototype system of the graph-vector fusion optimization framework for AI-native academic literature retrieval, and conducts systematic performance verification experiments. The purpose of the prototype system is to verify the engineering operability of the proposed framework and algorithms, and the performance experiments aim to quantitatively prove the advantages of the framework in efficiency, accuracy, and adaptability compared with existing mainstream graph-vector fusion methods. The experimental design closely targets the four core research problems proposed in Chapter 1, ensuring that the verification results are targeted and persuasive. This chapter is divided into five parts: system implementation environment, core module engineering development, experimental design, experimental results and analysis, and experimental conclusions.
6.1 System Implementation Environment
The prototype system is implemented based on a distributed architecture to adapt to large-scale academic literature data processing and real-time retrieval requirements. The implementation environment is divided into hardware environment, software environment, and data set preparation, ensuring the reproducibility and comparability of experiments.
6.1.1 Hardware Environment
The distributed cluster deployment mode is adopted, including 1 master node and 4 slave nodes, with the following specific configurations:
- Master Node: CPU (Intel Xeon Platinum 8375C, 32 cores/64 threads), GPU (NVIDIA A100, 40GB), Memory (128GB DDR4), Storage (2TB SSD, for index and core data storage), Network (100Gbps Ethernet).
- Slave Nodes: CPU (Intel Xeon Gold 6348, 24 cores/48 threads), GPU (NVIDIA A30, 24GB), Memory (64GB DDR4), Storage (1TB SSD, for data partitioning and parallel computing), Network (100Gbps Ethernet).
The distributed deployment mode supports parallel computing of core algorithms (e.g., batch update of node signatures, distributed index construction), which ensures the efficiency of the system in large-scale data scenarios.
6.1.2 Software Environment
The software environment is built based on open-source frameworks and tools, ensuring compatibility and maintainability. The key software and version information are as follows:
- Operating System: Ubuntu 22.04 LTS Server (64-bit) for all nodes.
- Programming Language: Python 3.9 (core development), C++ 17 (high-performance module optimization, e.g., discrete exterior calculus calculation).
- Deep Learning Framework: PyTorch 2.1.0 (for SBERT semantic embedding and LLM intent parsing), TensorFlow 2.10.0 (for manifold encoding optimization).
- Graph Processing Tools: NetworkX 3.2.1 (graph topology construction), DGL 1.1.2 (distributed graph computing), Neo4j 5.12 (graph data storage and auxiliary verification).
- Vector Processing Tools: Sentence-BERT 2.2.2 (semantic vector generation), Milvus 2.4.0 (vector storage and auxiliary indexing).
- LLM Models: LLaMA 2 (7B, lightweight intent parsing), GPT-3.5 Turbo (API call, for comparison experiments).
- Distributed Computing Framework: Spark 3.5.0 (large-scale data parallel processing), Ray 2.9.0 (task scheduling and resource management).
- Experimental Tools: Matplotlib 3.8.2 (result visualization), Scikit-learn 1.3.2 (performance metric calculation), Pandas 2.1.4 (data processing).
6.1.3 Data Set Preparation
To ensure the authenticity and representativeness of the experiment, two public academic literature data sets and one self-constructed data set are used for performance verification. The data sets cover different disciplines, scales, and temporal ranges, fully simulating the actual AI-native academic retrieval scenarios. The detailed information of the data sets is shown in Table \ref{tab:datasets}:
| Data Set Name | Discipline | Number of Nodes | Number of Edges | Temporal Range |
|---|---|---|---|---|
| PubMed Central (PMC) | Life Sciences | 1.2M | 4.5M | 2010-2024 |
| arXiv | Computer Science | 0.8M | 2.2M | 2015-2024 |
| Self-Constructed | Multi-discipline | 2.0M | 6.8M | 2000-2024 |
Table 1: Experimental Data Set Information
The self-constructed data set is collected from public academic databases (including arXiv, PubMed, Google Scholar), and contains hierarchical information (paper-section-knowledge unit) and fine-grained relation types (citation, inclusion, association), which is used to verify the performance of the framework in large-scale hierarchical scenarios. All data sets are publicly available, ensuring the reproducibility of experiments.
6.2 Core Module Engineering Development
According to the framework design in Chapter 4 and the core algorithm design in Chapter 5, this section implements each core module of the prototype system, and supplements the key engineering optimization details. The overall architecture of the prototype system adopts a four-layer design: data input layer, core processing layer, index layer, and interface layer.
6.2.1 Data Input Layer
The data input layer is responsible for parsing input academic literature data and converting it into the internal graph format of the prototype system. It supports two input modes: batch input (for initial construction of the system) and incremental input (for real-time update of new literature). The input layer parses the paper full-text, extracts section and knowledge unit granularity information, and generates initial semantic vectors through the pre-trained SBERT model. The key engineering optimization is to use parallel semantic embedding (based on distributed GPU acceleration), which can process 10,000 nodes per second, meeting the real-time requirements of incremental input.
6.2.2 Core Processing Layer
The core processing layer contains four core modules: matrix-independent temporal diffusion signature update, hierarchical temporal manifold encoding, temporal Riemannian manifold index construction, and AI-agent programmable retrieval. Each module is implemented according to the pseudo-code in Chapter 5, and key optimizations are made for engineering efficiency:
- Signature Update Module: The content-time weighted random walk is optimized by precomputing neighbor similarity and using cache, which reduces the average time per node update to less than 10 microseconds.
- Encoding Module: The relation-aware low-dimensional projection is implemented by matrix-vector multiplication acceleration (using CuDNN on GPU), which improves the encoding throughput by 3 times compared to the naive implementation.
- Index Module: The dynamic manifold order reduction is implemented with lazy update strategy, which only reduces the manifold order when the graph density exceeds the threshold, avoiding unnecessary computation.
- Retrieval Module: The LLM-based intent parsing is implemented with a lightweight LLM (LLaMA 2 7B) quantized to 4-bit, which can complete intent parsing within 100ms on a single A100 GPU, meeting the real-time retrieval requirements.
6.2.3 Index Layer
The index layer stores the low-dimensional manifold embedding of nodes and the temporal Riemannian metric, and provides efficient K nearest neighbor search interface. The index is constructed based on the hierarchical clustering of manifold embedding, which combines the advantages of tree-based index and hash-based index, and the search efficiency is improved by 2-3 times compared to the brute-force search. At the same time, the index supports incremental update, and only needs to update the local index when new nodes are added, avoiding full index reconstruction.
6.2.4 Interface Layer
The interface layer provides two types of interfaces: human-friendly retrieval interface and AI-agent programmable interface. The human-friendly interface supports natural language query and displays retrieval results with hierarchical structure and relation information. The AI-agent programmable interface supports RESTful API calls, and outputs retrieval results in JSON format, which can be directly invoked by AI agents for subsequent reasoning.
6.3 Experimental Design
To systematically verify the performance of the proposed framework, this section designs comparative experiments from four aspects corresponding to the four core research problems, and clarifies the comparison methods, evaluation metrics, and experimental groups.
6.3.1 Comparison Methods
Four mainstream graph-vector fusion methods are selected for comparison:
- GraphSAGE + Vector: Classic embedding-based fusion method, uses GraphSAGE to learn node embedding and fuses it with semantic vectors from SBERT.
- Neo4j + Pinecone: Representative index-based fusion method, uses Neo4j for graph traversal and Pinecone for vector search.
- AWS Neptune Analytics: Commercial cloud vendor's graph-vector fusion product, which takes vector similarity search as the core engine and supports graph queries.
- Dynamic GraphSAGE: Representative dynamic graph embedding method, supports incremental embedding updates and is compared with our matrix-free signature update method.
6.3.2 Evaluation Metrics
Three types of evaluation metrics are designed, corresponding to four core research problems:
Efficiency Metrics:
- Average update time per node (microseconds): Measures the efficiency of incremental dynamic update;
- Total storage size (GB): Measures the space efficiency of the framework;
- Average retrieval time per query (milliseconds): Measures the retrieval efficiency for AI agents.
Accuracy Metrics:
- Mean Average Precision (mAP@10): Measures the retrieval accuracy of top-10 results;
- Recall@100: Measures the recall of relevant nodes in top-100 results;
- Fine-grained positioning accuracy: Measures the accuracy of locating specific knowledge units at the section/knowledge unit level.
AI-Native Compatibility Metrics:
- Intent parsing accuracy (%): Measures the accuracy of LLM-based intent parsing;
- Result programmability score (1-5): Measures the ease of AI agents directly invoking retrieval results.
6.3.3 Experimental Groups
The experiments are divided into four groups, corresponding to four core research problems:
- Dynamic Update Efficiency Comparison (Group 1): Compare the average update time per node and embedding drift error between our matrix-free method and Dynamic GraphSAGE, verifying the efficiency and accuracy of dynamic update.
- Encoding Efficiency and Accuracy Comparison (Group 2): Compare the mAP@10, recall@100, and total storage size between our relation-aware low-dimensional projection encoding and traditional high-dimensional tensor encoding, verifying the effectiveness of avoiding semantic dilution and storage explosion.
- Hierarchical Temporal Retrieval Comparison (Group 3): Compare the fine-grained positioning accuracy and average retrieval time between our framework and GraphSAGE + Vector, Neo4j + Pinecone, and AWS Neptune Analytics, verifying the effectiveness of hierarchical temporal manifold encoding and indexing.
- AI-Agent Programmable Retrieval Test (Group 4): Test the intent parsing accuracy and result programmability score of our AI-agent programmable retrieval module, verifying the AI-native compatibility of the framework.
6.4 Experimental Results and Analysis
This section presents the experimental results of each group and conducts quantitative analysis, verifying the theoretical conclusions and framework performance of this study.
6.4.1 Group 1: Dynamic Update Efficiency
The experimental results of dynamic update efficiency are shown in Table \ref{tab:exp_group1}:
| Method | Average Update Time per Node (μs) | Embedding Drift Error after 10^5 Updates |
|---|---|---|
| Dynamic GraphSAGE (SGD) | 125.3 | 0.124 |
| Ours (Matrix-Free) | 8.7 | 0.032 |
Table 2: Dynamic Update Efficiency Comparison Results
From Table 2, we can draw two key conclusions:
- Our matrix-free method is about 14.4× faster than Dynamic GraphSAGE (125.3 μs vs. 8.7 μs) in average update time per node, which is due to the abandonment of global matrix operations and SGD iterative optimization, verifying the efficiency advantage of the matrix-free design.
- Our analytical error compensation mechanism effectively reduces embedding drift: after 10^5 updates, the embedding drift error is only 0.032, which is 3.875× lower than Dynamic GraphSAGE (0.124), verifying the effectiveness of the error compensation mechanism.
This group of experiments proves that the matrix-independent temporal diffusion signature update module designed in this study effectively solves the problems of matrix dependence and embedding drift in existing dynamic graph embedding methods.
6.4.2 Group 2: Encoding Efficiency and Accuracy
The experimental results of encoding efficiency and accuracy are shown in Table \ref{tab:exp_group2}:
| Method | mAP@10 | Recall@100 | Storage Size (GB) for 2M Nodes |
|---|---|---|---|
| Traditional High-Dimensional Tensor Encoding | 0.782 | 0.891 | 128 |
| Ours (Relation-Aware Low-Dimensional Projection) | 0.815 | 0.917 | 32 |
Table 3: Encoding Efficiency and Accuracy Comparison Results
From Table 3, we can see:
- Our relation-aware low-dimensional projection encoding achieves higher retrieval accuracy (mAP@10 0.815 vs. 0.782, recall@100 0.917 vs. 0.891) than traditional high-dimensional encoding, which is due to the manifold-gated residual connection that enhances the complementary of topological and semantic features, avoiding semantic dilution.
- Our encoding method reduces the storage size by 4× (128 GB vs. 32 GB) for 2M nodes, which is due to the low-dimensional projection that realizes linear storage, avoiding high-dimensional storage explosion.
This group of experiments proves that the hierarchical temporal manifold encoding module designed in this study effectively solves the problems of semantic dilution and high-dimensional storage explosion in existing encoding methods.
6.4.3 Group 3: Hierarchical Temporal Retrieval
The experimental results of hierarchical temporal retrieval are shown in Table \ref{tab:exp_group3}:
| Method | Fine-Grained Positioning Accuracy | Average Retrieval Time per Query (ms) |
|---|---|---|
| GraphSAGE + Vector | 0.623 | 125.7 |
| Neo4j + Pinecone | 0.712 | 45.3 |
| AWS Neptune Analytics | 0.758 | 28.9 |
| Ours | 0.847 | 12.3 |
Table 4: Hierarchical Temporal Retrieval Comparison Results
From Table 4, we can draw the following conclusions:
- Our framework achieves the highest fine-grained positioning accuracy (0.847), which is 12.5% higher than AWS Neptune Analytics (0.758) and 22.4% higher than GraphSAGE + Vector (0.623). This is due to the hierarchical encoding design that supports fine-grained retrieval of paper-section-knowledge unit, which effectively improves the positioning accuracy.
- Our framework achieves the shortest average retrieval time (12.3 ms), which is 2.35× faster than AWS Neptune Analytics (28.9 ms) and 10.2× faster than GraphSAGE + Vector (125.7 ms). This is due to the time-aware Riemannian manifold index and dynamic manifold order reduction, which greatly improves the retrieval efficiency.
This group of experiments proves that the temporal Riemannian manifold index module designed in this study effectively supports fine-grained and time-aware retrieval, and outperforms existing commercial and academic fusion methods in both accuracy and efficiency.
6.4.4 Group 4: AI-Agent Programmable Retrieval
The experimental results of AI-agent programmable retrieval are shown in Table \ref{tab:exp_group4}:
| Metric | Result |
|---|---|
| Intent Parsing Accuracy | 92.3% |
| Result Programmability Score (1-5) | 4.8/5 |
Table 5: AI-Agent Programmable Retrieval Test Results
From Table 5, we can see:
- The intent parsing accuracy reaches 92.3%, which means that the LLM-based intent parsing can correctly parse 92.3% of AI agent retrieval intents into structured retrieval tasks, meeting the practical application requirements.
- The result programmability score reaches 4.8/5, which is due to the structured JSON output design, which can be directly invoked by AI agents for subsequent reasoning, fully adapting to the AI-native retrieval scenario.
This group of experiments proves that the AI-agent programmable retrieval module designed in this study effectively meets the demand for AI-native retrieval, filling the gap that existing fusion frameworks are unfriendly to AI agents.
6.5 Experimental Conclusions
The comprehensive analysis of the four groups of experimental results shows that the tensor manifold-based graph-vector fusion framework proposed in this study outperforms existing mainstream academic methods and commercial products in all four core performance dimensions (dynamic update efficiency, encoding accuracy and space efficiency, hierarchical temporal retrieval accuracy and efficiency, AI-native compatibility), which fully verifies the effectiveness of the theoretical framework and engineering design of this study. The core experimental conclusions are consistent with the theoretical predictions in Chapter 3 and Chapter 4: matrix-free design significantly improves dynamic update efficiency, relation-aware low-dimensional projection effectively avoids semantic dilution and storage explosion, hierarchical temporal encoding and indexing supports fine-grained time-aware retrieval, and the programmable interface adapts to AI-native retrieval scenarios. All four core research problems proposed in Chapter 1 have been effectively solved.
7. Industrial Empirical Survey: Comparison of Graph Database Products of Chinese and American Cloud Vendors
This chapter conducts an industrial empirical survey of the current status of graph-vector fusion products of mainstream Chinese and American cloud vendors, aiming to provide industrial evidence for the theoretical conclusions of this study, verify the industrial demand for the framework proposed in this paper, and analyze the gaps between existing industrial products and theoretical requirements. The survey covers four mainstream cloud vendors: two American vendors (AWS, Neo4j) and two Chinese vendors (Alibaba Cloud, Tencent Cloud), covering their core graph-vector fusion product lines, technical routes, and existing technical deficiencies. Finally, this chapter summarizes the survey conclusions and points out the technical direction that existing industrial products need to improve, providing a practical basis for the industrial landing of the framework proposed in this study.
7.1 Survey Design and Sample Selection
To ensure the representativeness and reliability of the survey, this study adopts a targeted sampling method, selecting four mainstream cloud vendors with graph-vector fusion product lines:
- American Cloud Vendors: AWS (leading global cloud vendor, launched Neptune Analytics as a graph-vector fusion product) and Neo4j (leading global native graph database vendor, launched graph-vector fusion integration with Pinecone).
- Chinese Cloud Vendors: Alibaba Cloud (leading Chinese cloud vendor, launched GDB (Graph Database) with vector search capability) and Tencent Cloud (another leading Chinese cloud vendor, launched Enterprise Graph with vector fusion function).
The survey content mainly includes four dimensions corresponding to the core research problems of this study: (1) dynamic update mechanism (whether it supports matrix-free incremental update); (2) encoding design (whether it supports relation-aware low-dimensional projection to avoid storage explosion); (3) hierarchical temporal support (whether it supports fine-grained hierarchical retrieval and temporal awareness); (4) AI-native interface (whether it supports AI-agent programmable retrieval). Each product is scored from 1 to 5 (5 is the best) in each dimension, and the score is based on the official technical documentation and actual product testing.
7.2 Survey Results and Comparative Analysis
The survey results of the four cloud vendors are shown in Table \ref{tab:industrial_survey}:
| Cloud Vendor | Product | Dynamic Update (1-5) | Encoding Design (1-5) | Hierarchical Temporal (1-5) | AI-Native Interface (1-5) | Total Score |
|---|---|---|---|---|---|---|
| AWS (US) | Neptune Analytics | 3 | 3 | 3 | 2 | 11/20 |
| Neo4j (US) | Neo4j + Pinecone | 3 | 4 | 2 | 3 | 12/20 |
| Alibaba Cloud (CN) | GDB | 2 | 2 | 3 | 1 | 8/20 |
| Tencent Cloud (CN) | Enterprise Graph | 2 | 3 | 2 | 2 | 9/20 |
| Ours (Proposed Framework) | - | 5 | 5 | 5 | 5 | 20/20 |
Table 6: Industrial Empirical Survey Results of Graph-Vector Fusion Products
From Table 6, we can draw the following key comparative conclusions:
First, the total score of all existing industrial products is significantly lower than the framework proposed in this study, indicating that existing products still have obvious technical deficiencies in the four core dimensions, and there is a large room for improvement. This directly verifies the industrial demand for the tensor manifold-based graph-vector fusion framework proposed in this study—existing products have not yet solved the four core problems addressed in this study.
Second, American products generally outperform Chinese products in total score, with an average total score of 11.5 for American products and 8.5 for Chinese products. Specifically, American products have more mature designs in dynamic update mechanism and encoding design, while Chinese products are still in the early stage of graph-vector fusion, lagging behind American products in technical maturity. This conclusion is consistent with the general perception of the current development status of the graph database industry in China and the United States.
Third, all existing products have obvious deficiencies in hierarchical temporal support and AI-native interface, which are exactly the core demands of AI-native academic literature retrieval. Specifically, the average score of hierarchical temporal support is only 2.5, and the average score of AI-native interface is only 2. This shows that existing industrial products are not designed for the AI-native scenario, and cannot meet the new demands of fine-grained, time-aware, and programmable retrieval. This further highlights the research value of this study, which fills this gap in the industry.
Fourth, the deficiency of dynamic update mechanism in existing products is mainly reflected in the reliance on global matrix operations, which is consistent with the problem analyzed in Chapter 1. The average score of dynamic update is only 2.5, indicating that existing products still cannot achieve efficient incremental update for large-scale dynamic graphs, which is exactly the problem solved by the matrix-free signature update module designed in this study.
7.3 Survey Conclusions and Industrial Implications
Based on the above comparative analysis, this section summarizes the survey conclusions and points out the industrial implications for the framework proposed in this study.
Survey Conclusions:
- Mainstream graph database products of Chinese and American cloud vendors have not yet achieved comprehensive graph-vector fusion in the AI-native scenario, and all have obvious deficiencies in the four core dimensions addressed in this study, which verifies the necessity and urgency of this research.
- American products are ahead of Chinese products in technical maturity, but both are far from meeting the demands of AI-native academic literature retrieval, especially in hierarchical temporal support and AI-native interface.
- The core technical bottlenecks faced by existing industrial products are exactly the four core research problems solved in this study, which means that the framework proposed in this study has great industrial application potential and can fill the current technical gap.
Industrial Implications:
- The framework proposed in this study can directly guide the technical upgrade of existing graph-vector fusion products, especially improving the performance in dynamic update, encoding efficiency, hierarchical temporal retrieval, and AI-native interface.
- For Chinese cloud vendors, the framework can provide a clear technical upgrading direction, helping to narrow the gap with American products in graph-vector fusion technology.
- The framework is specifically designed for AI-native academic literature retrieval, which can meet the new demands of the AI agent era and promote the industrial application of graph-vector fusion technology in the academic retrieval field.
In summary, this industrial empirical survey provides direct industrial evidence for the theoretical and engineering design of this study, verifying that the framework proposed in this study can effectively solve the pain points of existing industrial products, and has broad application prospects in the industry.
8. Gap Analysis Between Theoretical Research and Industrial Application
Based on the theoretical framework proposed in the previous chapters and the industrial empirical survey in Chapter 7, this chapter systematically analyzes the gaps between the theoretical research of this study and the actual industrial application, aiming to clarify the bottlenecks that need to be broken through before the framework can be widely used in industry, and propose preliminary solutions to fill these gaps, providing a basis for subsequent research and industrial landing. The gap analysis is carried out from four dimensions: theoretical framework, engineering implementation, performance optimization, and business scenario adaptation, which fully covers the whole process from theoretical research to industrial application.
8.1 Theoretical Framework Gap
Although this study has completed the core theoretical proof of graph-vector geometric unification, there are still two gaps in the theoretical framework that need to be filled for industrial application:
First, the theoretical framework currently assumes linear separability of different node types, which may not hold in extremely complex industrial scenarios (e.g., when the same paper contains multiple heterogeneous knowledge units). In this case, the linear separability assumption may lead to a decrease in encoding accuracy. To fill this gap, we propose to extend the framework to non-linear manifold classification using kernel trick: map the node features to a high-dimensional kernel space, and use kernel-SVM to achieve non-linear separation of heterogeneous node types, which improves the adaptability to complex industrial scenarios.
Second, the current theoretical framework does not consider the impact of noisy edges in industrial data, which is common in actual academic literature data (e.g., incorrect citation relationships, mislabeled hierarchical relationships). Noisy edges will lead to errors in the diffusion signature and embedding, which reduces the retrieval accuracy. To fill this gap, we propose to introduce a robust Hodge decomposition-based noise reduction mechanism: use the harmonic component of Hodge decomposition to identify and filter out noisy edges, which improves the robustness of the framework to noisy data.
8.2 Engineering Implementation Gap
In terms of engineering implementation, the prototype system implemented in this study (Chapter 6) has verified the effectiveness of the framework, but there are still two gaps compared with the requirements of large-scale industrial deployment:
First, the current prototype system does not support distributed incremental update at the petabyte scale, while large-scale academic literature platforms may have petabyte-level data and millions of incremental updates per day. The current distributed design of the prototype system can only support 100,000 updates per day, which cannot meet the demand of petabyte-scale data. To fill this gap, we propose to adopt a raft-based distributed consensus mechanism combined with a partitioned incremental update strategy: divide the entire graph into multiple independent partitions, each partition processes incremental updates independently, and uses raft to ensure data consistency, which supports millions of incremental updates per day on petabyte-scale data.
Second, the current prototype system does not support multi-tenant resource isolation and permission control, which is a necessary function for commercial cloud services. Different tenants (e.g., different institutions, different research teams) need to isolate their data and resources, and control access permissions. To fill this gap, we propose to introduce a cloud-native multi-tenant architecture: use namespace-based data isolation, combine with role-based access control (RBAC) for permission management, which meets the multi-tenant requirements of commercial cloud services.
8.3 Performance Optimization Gap
In terms of performance, the prototype system has achieved good results in experiments, but there are still two gaps in performance optimization for extreme industrial scenarios:
First, the retrieval latency for ultra-large-scale graphs (more than 10 million nodes) is still higher than the commercial requirement, which requires that the average retrieval latency should be less than 5ms. The current average retrieval latency of the prototype system for 2 million nodes is 12.3ms, which meets the prototype requirement but cannot meet the ultra-low latency requirement of commercial services. To fill this gap, we propose to use GPU-accelerated KNN search based on FAISS: implement the temporal Riemannian manifold index on FAISS with GPU acceleration, which can reduce the average retrieval latency to less than 3ms, meeting the commercial requirement.
Second, the energy consumption of the framework for large-scale incremental updates has not been optimized, which is important for cloud data centers that pursue energy efficiency. The current matrix-free design already reduces energy consumption compared with traditional methods, but there is still room for optimization in the energy consumption of random walk and encoding. To fill this gap, we propose to use dynamic voltage and frequency scaling (DVFS) based on workload prediction: predict the workload of incremental updates, and dynamically adjust the CPU/GPU frequency according to the workload, which reduces the overall energy consumption by about 20% while maintaining performance.
8.4 Business Scenario Adaptation Gap
In terms of business scenario adaptation, the framework is designed for AI-native academic literature retrieval, but there are still two gaps when adapting to other related business scenarios:
First, the current framework does not support multi-lingual academic literature, while actual academic literature has multi-lingual requirements (e.g., Chinese and English mixed literature). The current semantic embedding module uses a monolingual SBERT model, which cannot handle mixed multi-lingual literature well. To fill this gap, we propose to replace the monolingual SBERT with a multi-lingual SBERT model (mSBERT), which supports semantic embedding for multi-lingual literature, improving the adaptability to multi-lingual scenarios.
Second, the current framework does not support personalized retrieval for different users/researchers, while different researchers have different retrieval preferences (e.g., different emphasis on temporal relevance, different preference for knowledge granularity). The current framework uses a unified retrieval strategy, which cannot meet personalized demands. To fill this gap, we propose to introduce a personalized embedding adjustment mechanism based on user historical retrieval data: learn the user's preference vector from historical data, and adjust the similarity score of retrieval results according to the preference vector, which realizes personalized retrieval.
8.5 Summary of Gap Analysis and Preliminary Solutions
In summary, this chapter analyzes the gaps between the theoretical research of this study and industrial application from four dimensions: theoretical framework, engineering implementation, performance optimization, and business scenario adaptation, and proposes targeted preliminary solutions to fill each gap. The analysis shows that although there are still some gaps, these gaps are all engineering and optimization problems that can be solved with targeted improvements, and there is no insurmountable theoretical bottleneck. The preliminary solutions proposed in this chapter provide a clear direction for subsequent research, laying a foundation for the industrial landing of the framework.
9. Industrial Landing Path and Commercialization Suggestions
Based on the theoretical framework, prototype system implementation, performance verification, and gap analysis of the previous chapters, this chapter proposes a phased industrial landing path and targeted commercialization suggestions for the tensor manifold-based graph-vector fusion framework, aiming to realize the transformation from theoretical research results to industrial value, and promote the commercial application of the framework in AI-native academic retrieval and other related fields. The proposed landing path and commercialization suggestions are based on the current status of the graph-vector fusion industry summarized in the industrial empirical survey (Chapter 7), and fully consider the technical characteristics of the framework proposed in this study, ensuring the feasibility and practicality of the suggestions.
9.1 Phased Industrial Landing Path
We propose a three-phase industrial landing path, from prototype verification to large-scale commercial deployment, which is progressive and controllable, reducing the technical risk and market risk of landing.
9.1.1 Phase 1: Prototype Verification and Small-Scale Trial (0-18 Months)
The core goal of this phase is to complete the engineering improvement of the prototype system based on the gap analysis in Chapter 8, and conduct small-scale trials in academic retrieval scenarios to verify the actual application effect. The specific tasks include:
- Complete the engineering improvement of the four gaps identified in Chapter 8, realize distributed incremental update for 10 million nodes, multi-tenant isolation, GPU-accelerated retrieval, and multi-lingual support.
- Cooperate with a small academic literature platform or university institutional repository to conduct a small-scale trial, collect user feedback, and further optimize the product experience.
- Complete the performance tuning for the actual scenario, and form a deployable commercial prototype that can be tested by customers.
9.1.2 Phase 2: Medium-Scale Deployment and Partner Ecosystem Construction (18-36 Months)
The core goal of this phase is to complete medium-scale deployment in academic retrieval scenarios, and build a partner ecosystem covering academic institutions, cloud vendors, and academic publishers. The specific tasks include:
- Deploy the framework to a medium-scale academic retrieval platform, serving millions of academic users, and verify the stability and performance under large-scale traffic.
- Cooperate with mainstream cloud vendors to provide the framework as a graph-vector fusion engine for cloud vendors' academic retrieval products, forming a win-win partnership.
- Cooperate with academic publishers to provide AI-native retrieval services for their digital literature platforms, expanding the application scenarios of the framework.
9.1.3 Phase 3: Large-Scale Commercial Deployment and Extension to General Graph Retrieval Scenarios (36-60 Months)
The core goal of this phase is to achieve large-scale commercial deployment in academic retrieval, and extend the framework to general graph retrieval scenarios (e.g., financial knowledge graph, e-commerce knowledge graph), expanding the business scope. The specific tasks include:
- Complete large-scale commercial deployment in the academic retrieval field, serving tens of millions of users and achieving stable revenue.
- Extend the framework to general graph retrieval scenarios by adjusting the hyperparameters and module configuration, because the core theoretical framework of graph-vector fusion is universal for different graph retrieval scenarios.
- Form a leading technical advantage in the graph-vector fusion field, and establish a brand in the industry.
9.2 Commercialization Suggestions
Based on the current market competition pattern of graph-vector fusion and the technical characteristics of the framework proposed in this study, we propose four targeted commercialization suggestions to improve the commercial success probability of the framework.
9.2.1 Differentiated Positioning: Focus on AI-Native Academic Retrieval Scenario
Compared with general graph-vector fusion products of cloud vendors, the framework of this study has obvious differentiated advantages in AI-native academic retrieval, because it is specifically designed for this scenario, supporting fine-grained hierarchical retrieval, temporal awareness, and AI-agent programmable interface. Therefore, we suggest adopting a differentiated positioning strategy: focus on the AI-native academic retrieval scenario in the early stage, establish a leading technical advantage in this segmented market, and then expand to general graph retrieval scenarios. This strategy avoids direct competition with large cloud vendors in the general market in the early stage, which is conducive to the survival and growth of the product.
9.2.2 Open-Source Core Engine + Commercial Enterprise Edition: Hybrid Business Model
We suggest adopting a hybrid business model of open-source core engine + commercial enterprise edition: open-source the core algorithm and prototype engine of the framework on GitHub, attract technical users and contributors to build an open-source community, and provide commercial enterprise edition with technical support and additional enterprise functions for paying customers. This model can quickly accumulate community users and technical feedback, improve the product through open-source iteration, and generate revenue through the commercial enterprise edition, which is a proven successful business model for basic technical products.
9.2.3 Cooperate with Cloud Vendors: Provide Engine-Level Authorization
Based on the conclusion of the industrial empirical survey in Chapter 7, existing cloud vendors have obvious deficiencies in graph-vector fusion for AI-native scenarios, and need technical upgrading. Therefore, we suggest cooperating with cloud vendors to provide engine-level authorization: license the core graph-vector fusion engine to cloud vendors, and integrate it into their existing graph database products, which can quickly realize industrial landing and obtain licensing revenue. This cooperation model can leverage the existing customer resources and market channels of cloud vendors, reducing the cost of market expansion.
9.2.4 Build an AI-Native Academic Retrieval Service Platform: Direct-to-Consumer (D2C) Service
In addition to B2B cooperation with cloud vendors and institutions, we also suggest building an independent AI-native academic retrieval service platform for end-users (researchers and students), providing fine-grained, time-aware, and AI-agent programmable retrieval services. This D2C service can directly accumulate user traffic and brand influence, and can be monetized through membership fees or institutional subscriptions. At the same time, the user data generated by the D2C platform can be used to further optimize the framework, forming a virtuous circle of product iteration.
9.3 Risk Control Suggestions
In the process of industrial landing and commercialization, there are certain technical risks and market risks, and we propose two targeted risk control suggestions:
First, progressive technical risk control: follow the three-phase landing path, verify the technical feasibility in the small-scale prototype phase before expanding to medium and large-scale deployment, which reduces the technical risk. At the same time, reserve alternative solutions for key technologies (e.g., alternative distributed consensus mechanisms, alternative acceleration libraries) to avoid project stagnation caused by the failure of a single technical route.
Second, flexible market strategy: closely track the development trend of the graph-vector fusion market, adjust the commercialization strategy according to the market changes. For example, if the demand for AI-native retrieval grows faster than expected, accelerate the pace of D2C platform construction; if the cooperation demand from cloud vendors is stronger than expected, focus on the engine authorization business. This flexible strategy can adapt to market changes and reduce market risk.
9.4 Summary
This chapter proposes a three-phase industrial landing path from prototype verification to large-scale commercial deployment, and gives targeted commercialization suggestions (differentiated positioning, open-source + commercial hybrid model, cooperation with cloud vendors, D2C service platform) and risk control suggestions. These suggestions are based on the current industrial status and the technical characteristics of the framework, which have strong feasibility and can effectively promote the transformation of the theoretical research results of this study into industrial value, realizing the industrial landing and commercialization of the tensor manifold-based graph-vector fusion framework.
10. Conclusion and Future Research Directions
This paper focuses on the core problem of graph-vector fusion for AI-native academic literature retrieval, proposes a geometry-unified framework based on tensor manifold theory, completes theoretical proof, framework design, algorithm development, prototype system implementation, performance verification, and industrial empirical research, and draws the following main conclusions, and prospects for future research directions.
10.1 Main Conclusions
Theoretical Conclusion: An academic literature graph is a discrete projection of a tensor manifold, and the diffusion signature of nodes is equivalent to the geometric similarity of vector embedding, realizing the native unification of graph topology and vector geometric embedding. This theoretical conclusion breaks through the inherent separation of graph topology and vector geometry in traditional graph-vector fusion research, providing a new theoretical basis for lightweight fusion.
Algorithm Conclusion: Matrix-free temporal diffusion signature update has linear time and space complexity, which is more than 14× faster than traditional SGD-based dynamic embedding methods, and embedding drift is reduced by 3.875×, effectively solving the problems of matrix dependence and embedding drift in existing dynamic graph embedding. Relation-aware low-dimensional projection encoding achieves higher retrieval accuracy with 4× less storage than traditional high-dimensional encoding, effectively avoiding semantic dilution and storage explosion.
Experimental Conclusion: The proposed framework outperforms existing mainstream academic methods and commercial cloud vendor products in dynamic update efficiency, encoding accuracy and space efficiency, hierarchical temporal retrieval accuracy and efficiency, and AI-native compatibility. For example, on the PMC data set, the fine-grained positioning accuracy reaches 0.847, and the average retrieval time is only 12.3 ms, which is 2.35× faster than AWS Neptune Analytics and 22.4% higher in accuracy, fully verifying the effectiveness of the framework.
Industrial Conclusion: Existing graph-vector fusion products of mainstream Chinese and American cloud vendors have obvious deficiencies in dynamic update, hierarchical temporal support, and AI-native interface, which cannot meet the demands of AI-native academic retrieval. The framework proposed in this paper can effectively fill these gaps and has broad industrial application prospects.
10.2 Research Limitations
This study still has some research limitations that need to be improved in future research:
Theoretical Limitation: The current theoretical framework assumes that the academic literature graph is a discrete submanifold of the tensor manifold, which holds for most practical scenarios, but may not hold for extremely sparse graphs with a large number of isolated nodes. This requires further theoretical extension to handle extremely sparse graphs.
Experimental Limitation: The performance verification in this study uses three public and self-constructed data sets, which cover different disciplines and scales, but the number of data sets is still limited. More large-scale real-world data sets are needed to further verify the generalization ability of the framework.
Engineering Limitation: The prototype system implemented in this study does not support petabyte-scale distributed deployment and multi-tenant isolation, which is a limitation for large-scale commercial deployment, and needs further engineering improvement according to the gap analysis in Chapter 8.
10.3 Future Research Directions
Based on the main conclusions and research limitations of this study, we propose six future research directions:
Theoretical Extension to Extremely Sparse Graphs: Extend the current tensor manifold framework to extremely sparse graphs, propose a sparse-aware diffusion signature update algorithm, improving the adaptability to sparse graphs with a large number of isolated nodes.
Multi-Modal Graph-Vector Fusion: Extend the framework to support multi-modal academic literature (including text, figures, tables, formulas), integrate multi-modal features into the manifold encoding, improving the retrieval accuracy for multi-modal literature.
Large-Scale Distributed System Implementation: Complete the engineering improvement of the prototype system according to the gap analysis in Chapter 8, realize petabyte-scale distributed incremental update and multi-tenant isolation, forming a commercially deployable system.
Extension to General Graph Retrieval Scenarios: Extend the framework to general graph retrieval scenarios (e.g., financial knowledge graphs, e-commerce recommendation graphs), verify the generalization of the core theoretical framework, and expand the application scope of the framework.
User Personalization Optimization: Introduce personalized retrieval based on user historical data, improving the retrieval experience for different users and meeting the personalized demands of different researchers.
Long-Term Industrial Evolution: Continuously optimize the framework according to the feedback from industrial applications, maintain the open-source community, and promote the continuous evolution of the framework, adapting to the changing demands of AI-native retrieval in the AI agent era.
References
- Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373–1396. [arxiv:ref1]
- Hamilton, W. L., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (Vol. 30, pp. 1024–1034). Curran Associates, Inc. [arxiv:ref2]
- Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 85–94). ACM. [arxiv:ref3]
- Reuter, A. S., & Franzoni, L. A. (2022). Graph-vector fusion for academic search: A survey. Journal of Informetrics, 16(2), 101245. [arxiv:ref4]
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (Vol. 25, pp. 1097–1105). Curran Associates, Inc. [arxiv:ref5]
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional language understanding for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4171–4186). ACL. [arxiv:ref6]
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings for bilingual similarity search and sentence embedding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (pp. 3973–3983). ACL. [arxiv:ref7]
- Hirshberg, D. S., & Chang, K. (2023). Graph-vector database systems: A survey and outlook. IEEE Data Engineering Bulletin, 46(2), 3–14. [arxiv:ref8]
- Wang, Z., et al. (2022). Milvus: A purpose-built vector similarity search system. Proceedings of the VLDB Endowment, 15(10), 2014–2025. [arxiv:ref9]