AI Empowered Blockchain Data Indexing: The Evolution from The Graph to Chainbase

2025-07-10 11:14:00

The Evolution of Blockchain Data Indexing: From Nodes to AI-Powered Full Chain Data Services

1 Introduction

Since the first batch of dApps emerged in 2017, to the present day where various blockchain applications are flourishing, have we ever considered where the data used by these dApps comes from?

In 2024, AI and Web3 become hot topics. In the field of AI, data is like the source of life. Just as plants need sunlight and water, AI systems also rely on massive amounts of data to continuously learn and evolve. Without data, even the most sophisticated AI algorithms struggle to demonstrate their intended intelligence.

This article will deeply analyze the evolution of data indexing in the process of industry development from the perspective of blockchain data accessibility, and compare the established indexing protocol The Graph with the emerging Chainbase and Space and Time, discussing the similarities and differences of these two new protocols that combine AI technology in terms of data services and product architecture.

2 The Complexity and Simplicity of Data Indexing: From Blockchain Nodes to Full Chain Database

2.1 Data Source: Blockchain Node

Blockchain is regarded as a decentralized ledger. Nodes are the foundation of the blockchain network, responsible for recording, storing, and disseminating all on-chain transaction data. Each node has a complete copy of the blockchain data, maintaining the network's decentralized characteristics. However, for ordinary users, building and maintaining a node is not an easy task. This requires not only specialized skills but also high hardware and bandwidth costs. The query capabilities of ordinary nodes are also limited, making it impossible to obtain data in the format needed by developers. Therefore, while theoretically anyone can run a node, in practice users rely heavily on third-party services.

To solve this problem, RPC node providers have emerged. They are responsible for the cost and management of nodes, providing data through RPC endpoints. Users can access Blockchain data without having to build their own nodes. Public RPC endpoints are free but have rate limits, which may affect the dApp experience. Private RPC endpoints perform better, but simple data retrieval also requires a lot of communication, leading to inefficiency and scalability issues. However, the standardized API interfaces of node providers lower the threshold for data access, laying the foundation for subsequent data parsing and application.

2.2 Data Parsing: From Prototype Data to Usable Data

The raw data provided by blockchain nodes is usually encrypted and encoded, ensuring integrity and security, but it also increases the difficulty of parsing. For ordinary users or developers, directly handling this data requires a significant amount of technical knowledge and computational resources.

The data parsing process thus becomes crucial. By converting complex prototype data into an easily understandable and operable format, users can utilize this data more intuitively. The success or failure of the parsing directly affects the efficiency of Blockchain data applications and is a key step in the entire indexing process.

2.3 The Evolution of Data Indexers

As the amount of Blockchain data increases, the demand for indexers is growing. Indexers organize on-chain data and send it to databases for querying. They index Blockchain data and make it available at any time through SQL-like query languages like GraphQL API (. Indexers provide a unified query interface, allowing developers to quickly and accurately retrieve information using standardized languages, greatly simplifying the process.

Different types of indexers optimize data retrieval in various ways:

Full Node Indexer: Operates a full Blockchain node to directly extract data, ensuring completeness and accuracy, but requires substantial storage and processing power.
Lightweight Indexer: Relies on full nodes to obtain specific data on demand, reducing storage requirements but potentially increasing query time.
Dedicated Indexer: Optimized retrieval for specific data types or Blockchain, such as NFT data or DeFi transactions.
Aggregated Indexer: Extracts data from multiple blockchains and sources, including off-chain information, providing a unified query interface, suitable for multi-chain dApps.

Currently, Ethereum archive nodes occupy about 13.5TB of storage in the Geth client and about 3TB in the Erigon client. As the Blockchain grows, storage requirements continue to increase. In the face of large data volumes, mainstream indexing protocols support multi-chain indexing and customize data parsing frameworks for different application needs, such as The Graph's "subgraph" framework.

The indexer significantly improves data indexing and query efficiency. Compared to traditional RPC endpoints, the indexer can efficiently index large amounts of data and support high-speed queries. Users can perform complex queries, easily filter, and analyze data. Some indexers also support aggregating multi-chain data sources, avoiding the need for multiple APIs when deploying multi-chain dApps. Distributed operation provides stronger security and performance, reducing the risk of interruptions that may arise from centralized RPC providers.

The indexer allows users to directly obtain the required information without having to deal with complex underlying data through a predefined query language. This greatly improves the efficiency and reliability of data retrieval and is an important innovation in blockchain data access.

![Read, index to analyze, a brief overview of the Web3 data indexing track])https://img-cdn.gateio.im/webp-social/moments-cf9a002b9b094fbbe3be7f611001b5c1.webp(

) 2.4 Full-Chain Database: Aligning to Flow Priority

Using index nodes to query data usually means that the API becomes the sole means of processing on-chain data. However, when projects enter the expansion stage, more flexible data sources are often needed, and standardized APIs struggle to meet this demand. As application requirements become more complex, primary indexers and their standardized indexing formats gradually find it difficult to cope with diverse query needs, such as searching, cross-chain access, or off-chain data mapping.

In modern data pipeline architecture, the "stream-first" approach has become a solution to the limitations of traditional batch processing, enabling real-time data ingestion, processing, and analysis. This paradigm shift allows organizations to respond immediately to incoming data, deriving insights and making decisions almost in real-time. Similarly, blockchain data service providers are also moving towards building data streams, with traditional indexing service providers launching real-time blockchain data stream products, such as The Graph's Substreams, Goldsky's Mirror, and real-time data lakes based on blockchain-generated data from Chainbase and SubSquid.

These services aim to address the need for real-time parsing of Blockchain transactions and providing comprehensive query capabilities. Just as the "stream-first" architecture has revolutionized traditional data processing methods by reducing latency and enhancing responsiveness, these Blockchain data stream service providers also hope to support more application development and assist on-chain data analysis through more advanced and mature data sources.

Revisiting the on-chain data challenges from the perspective of modern data pipelines allows us to see the potential of data management, storage, and provision from a new angle. When we view indexers like Subgraph and Ethereum ETL as data streams rather than final outputs, we can envision a world where high-performance datasets can be tailored for any business use case.

![Read, index to analyze, brief on the Web3 data indexing track]###https://img-cdn.gateio.im/webp-social/moments-b343cab5112c1a3d52f4e72122ae0df2.webp(

3 AI + Database? In-depth comparison of The Graph, Chainbase, Space and Time

) 3.1 The Graph

The Graph network achieves multi-chain data indexing and querying services through a decentralized node network, making it easier for developers to index blockchain data and build applications. Its main product models are the data query execution market and the data indexing cache market, both serving the querying needs of users. The query execution market refers to consumers paying for the appropriate indexing nodes for the data they need, while the indexing cache market involves indexing nodes allocating resources based on the historical popularity of subgraphs, query fees, and curation demands.

Subgraphs are the fundamental data structure of The Graph network, defining how to extract and transform data from the Blockchain into a queryable format. Anyone can create a subgraph, and multiple applications can reuse it, enhancing data reusability and efficiency.

The Graph network consists of four roles: Indexers, Curators, Delegators, and Developers, which together support the data needs of web3 applications. The responsibilities of each role are as follows:

Indexer: Network node operators participate in the network by staking GRT, providing indexing and query processing services.
Delegator: Stake GRT to index nodes to support operations and earn a portion of the rewards from the entrusted nodes.
Curator: Responsible for which subgraphs of the signals should be prioritized for indexing by the network, ensuring that valuable subgraphs are processed.
Developers: The main users of The Graph, who create and submit subgraphs to the network, waiting for data requests to be fulfilled.

Currently, The Graph has shifted to a fully decentralized subgraph hosting service, with economic incentives among participants to ensure the system operates:

Index nodes earn revenue by querying fees and a portion of GRT block rewards.
The delegator receives a portion of the rewards from the supported index nodes.
If a curator believes the signal has valuable subgraphs, they can receive part of the reward from the query fee.

The Graph product is rapidly developing in the wave of AI. Semiotic Labs, as one of the core development teams, is committed to optimizing index pricing and user query experience using AI technology. The currently developed tools AutoAgora, Allocation Optimizer, and AgentC have improved the performance of the ecosystem in multiple aspects:

AutoAgora introduces a dynamic pricing mechanism that adjusts prices in real-time based on query volume and resource usage, optimizing pricing strategies to ensure the competitiveness of indexers and maximize revenue.
Allocation Optimizer addresses the resource allocation challenges of subgraphs, helping indexers achieve optimal configurations to enhance revenue and performance.
AgentC allows users to access Blockchain data through natural language, enhancing user experience.

The application of these tools has further enhanced the system's intelligence and user-friendliness of The Graph in conjunction with AI.

![Reading, indexing to analysis, a brief overview of the Web3 data indexing track]###https://img-cdn.gateio.im/webp-social/moments-97443cbd177ac4ffd1665da670ffbf12.webp(

) 3.2 Chainbase

Chainbase is a full-chain data network that integrates all Blockchain data into one platform, making it easier for developers to build and maintain applications. Its unique features include:

Real-time Data Lake: Provides a dedicated real-time data lake for blockchain data streams, allowing data to be accessed immediately upon generation.
Dual-chain architecture: Built on Eigenlayer AVS for the execution layer, forming a parallel dual-chain architecture with the CometBFT consensus algorithm. This design enhances cross-chain data programmability and composability, supports high throughput, low latency, and finality, and improves network security through dual staking.
Innovative Data Format Standard: Introduce the new data format standard "manuscripts" to optimize the structuring and utilization of data in the cryptocurrency industry.
Cryptocurrency World Model: Combining AI model technology and utilizing vast blockchain data resources to create an AI model that can effectively understand, predict blockchain transactions and interact with them. The basic version model Theia has been launched for public use.

These features make Chainbase stand out in the indexing protocol, with a particular focus on real-time data accessibility, innovative data formats, and the creation of smarter models through the combination of on-chain and off-chain data to enhance insights.

Chainbase's AI model Theia is the key that distinguishes it from other data service protocols. Theia is based on the DORA model developed by NVIDIA, integrating on-chain and off-chain data with temporal and spatial activities to learn and analyze encryption patterns. It responds through causal reasoning, deeply mining the potential value and patterns of on-chain data, providing users with more intelligent data services.

AI-powered data services make Chainbase not only a Blockchain data service platform but also a competitive intelligent data service provider. Through powerful data resources and proactive AI analysis, Chainbase can provide broader data insights and optimize the user data processing process.

3.3 Space and Time

Space and Time ###SxT( is committed to creating a verifiable computing layer, expanding zero-knowledge proofs on a decentralized data warehouse, providing trusted data processing for smart contracts, large language models, and enterprises. It has currently raised $20 million in Series A funding, led by Framework Ventures, Lightspeed Faction, Arrington Capital, and Hivemind Capital.

In the field of data indexing and verification, Space and Time introduces an innovative technological approach—Proof of SQL. This is a zero-knowledge proof technology developed by SxT, ensuring that SQL queries executed on decentralized data warehouses are tamper-proof and verifiable. When running a query, Proof of SQL generates cryptographic proofs that verify the integrity and accuracy of the query results. The proof is attached to the results, and any verifier ), such as a smart contract (, can independently confirm that the data processing has not been tampered with. Traditional Blockchain networks typically rely on consensus mechanisms to verify the authenticity of data, while Proof of SQL achieves a more efficient data verification method. In the SxT system, one Node is responsible for data acquisition, while other Nodes verify the authenticity of the data through zk technology. This changes the resource consumption involved in multi-node redundant indexing of data to reach consensus under a consensus mechanism, enhancing the overall performance of the system. As the technology matures, it emphasizes the importance of data reliability.

GRT4.05%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

22 Likes