Analyzed different blockchain architectures and provided a detailed introduction to key technologies such as cryptography and consensus algorithms; also offered development guidelines for Bitcoin and Ethereum smart contracts within blockchain; introduced application cases of blockchain and discussed common issues related to blockchain.
Shared insights on architecture and discussed how architectural changes drive the development of the IT era, as well as the evolution from Internet+ to Blockchain+, sharing some personal insights and viewpoints with readers.
Developed a professional habit of focusing on typical IT architectures. If we break down the technologies involved in blockchain, most of them, whether virtual currency, cryptography, consensus algorithms, or P2P communication, have traceable origins that predate Bitcoin. The reason blockchain has become so important is mainly due to its organic combination of these technologies, forming a decentralized, automated transaction execution, and self-governing architecture. Therefore, rather than saying blockchain is a technological innovation, it is more accurate to say it is an architectural innovation.
Most people tend to attribute social progress and changes of the times to the invention of a particular technology, often overlooking the fact that the key role is usually played by architectural innovation. Looking back at the history of IT development, every major transformation has invariably been due to the emergence of new architectures that created new capabilities and brought about new applications, which in turn drove change. The relationship between architecture and technology is somewhat akin to the relationship between a symphony conductor and various instrumentalists. Just as an outstanding conductor orchestrates different instrumentalists to perform a magnificent symphony, a good architecture integrates components that provide different technological functions through spatial and temporal arrangements, as well as communication and coordination of information, forming a complete computing system with certain functionalities.
From the perspective of computer development history, every technological innovation, such as mechanical technology, electromechanical technology, vacuum tube technology, semiconductor transistor technology, integrated circuit technology, and large-scale integrated circuit technology, has propelled the development of computers. However, what truly has epoch-making significance and plays a decisive role is the innovation of computational architecture. This is evident from the famous Turing machine model.
In 1937, British scientist Alan Turing published the famous paper "On Computable Numbers, with an Application to the Entscheidungsproblem." In this paper, "Entscheidungsproblem" refers to the German term for "decidability problem." "Decidability" means that for a decision problem, if a program can be written such that it can provide answers to individual problems for any element in the domain, then that decision problem is termed decidable. Turing proposed the concept of a computer abstract model—the Turing machine. The Turing machine consists of a controller, an infinitely extendable tape, and a read/write head that moves left and right on the tape. This architecturally simple machine can theoretically compute any intuitively computable function. However, Turing also proved that there is no algorithm that can solve the decidability problem. In other words, some computational problems are unsolvable. All computable algorithms can be executed by a Turing machine. Turing's theory proved that it is possible to create a universal computer capable of executing any computation by programming.
The significance of the Turing machine lies in its ability to endow inorganic matter with computational capabilities that were originally only possessed by biological entities through a simple architecture. This is a groundbreaking event with epoch-making significance in human history. The Turing machine, as a theoretical model of computation, laid the foundation for modern computer science.
In 1945, based on the Turing machine model, Hungarian scientist John von Neumann proposed the concept of "stored-program." "Stored-program" refers to placing program instructions and data in the same memory, where the addresses of program instructions and data point to different physical locations in the same memory; using a single address and data bus, with the same width for instructions and data. When the processor executes instructions, it first retrieves and decodes the instructions from memory, then fetches the operands to perform calculations. The program counter (PC) is a register within the CPU that indicates the storage location of instructions and data. The CPU addresses memory using the address information provided by the program counter to find the required instructions or data, decodes the instructions, and finally executes the operations specified by the instructions in sequence. This architecture later became known as the "von Neumann architecture," also referred to as the "Princeton architecture," as shown in Figure 11-1. It consists of input and output devices, a central control unit (CPU), memory, and a bus connecting the CPU and storage.
Figure 11-1 Von Neumann Architecture
Since program instructions and data share the bus, computers based on the von Neumann architecture have poor parallel processing capabilities, slow data processing speeds, and suffer from the so-called "von Neumann bottleneck." However, the advantage of this architecture is its simplicity, as it does not require separate program and data storage, significantly reducing the complexity of peripherals. Therefore, even today, this architecture remains the mainstream architecture for current computers, with the vast majority of computers belonging to the von Neumann architecture. Typical chips like Intel, ARM's ARM7, and MIPS are all chips that support the von Neumann architecture.
The earliest computing machines contained only fixed-purpose programs. Some modern computers still maintain such a design, usually for simplification or educational purposes. For example, a calculator has only fixed mathematical computation programs and cannot be used as word processing software, nor can it be used to play games. To change the program of this machine, one must alter the wiring, change the structure, or even redesign the machine. Of course, the earliest computers were not designed to be programmable in that way. The so-called "rewriting program" likely referred to designing program steps on paper, then formulating engineering details, and subsequently modifying the machine's circuitry or structure.
The concept of stored-program computers changed all of this. By creating a set of instruction set architecture and transforming the so-called operations into a series of program instruction execution details, this machine became more flexible. By treating instructions as a special type of static data, a stored-program computer can easily change its program and alter its computational content under program control. The von Neumann structure and stored-program computer are interchangeable terms, as will be discussed below. The Harvard architecture is a design concept that separates program data storage from ordinary data storage, but it does not completely break away from the von Neumann architecture.
The stored-program concept also allows programs to modify their own computational content during execution. One of the design motivations for this concept is to enable programs to add content or change the storage location of program instructions, as early designs required users to modify them manually. However, as index registers and indirect addressing became essential mechanisms in hardware structure, this functionality became less important. The feature of self-modifying programs has also been abandoned by modern programming design due to the difficulties it poses for understanding and debugging, and the pipeline and cache mechanisms of modern central processing units reduce the efficiency of this feature.
Overall, treating instructions as data has enabled the realization of assembly languages, compilers, and other automated programming tools; these "automated programming programs" can be used to write programs in a way that is easier for humans to understand [1]; from a localized perspective, machines emphasizing I/O, such as Bitblt, previously believed that modifying patterns on the screen was impossible without customized hardware. However, it was later shown that these functions could be effectively achieved through "in-process compilation" techniques.
This structure, of course, has its flaws. In addition to the von Neumann bottleneck mentioned below, modifying programs can be very damaging, whether intentionally or due to design errors. In a simple stored-program computer, a poorly designed program could harm itself, other programs, or even the operating system, leading to system crashes. Buffer overflow is a typical example. The ability to create or modify other programs has also led to the emergence of malware. By exploiting buffer overflow, a malicious program can overwrite the call stack and rewrite code, as well as modify other program files to cause cascading damage. Memory protection mechanisms and other forms of access control can protect against accidental or malicious code modifications.
Harvard Architecture#
In contrast to the von Neumann architecture, the Harvard architecture is a computing architecture that separates program instruction storage from data storage, as shown.
The central processing unit first reads the program instruction content from the program instruction memory, decodes it to obtain the data address, then reads the data from the corresponding data memory and performs the next operation (usually execution). By separating program instruction storage and data storage, instructions and data can have different data widths. Microprocessors based on the Harvard architecture typically exhibit higher execution efficiency. Their program instructions and data instructions are organized and stored separately, allowing for pre-reading of the next instruction during execution. Many central processors and microcontrollers currently using the Harvard architecture include signal processing chips (DSP), Motorola's MC68 series, Zilog's Z8 series, ATMEL's AVR series, and ARM's ARM9, ARM10, and ARM11. The advantage of the Harvard architecture is its high efficiency, but its complexity and high requirements for peripheral device connections and processing make it more suitable for applications with fewer peripherals, such as embedded microcontrollers.
Comparison of Harvard Architecture#
Improved Harvard architecture machines resemble Harvard machines but relax the strict separation between instructions and data, still allowing the CPU to access two (or more) memory buses simultaneously. The most common modifications include separate instruction and data caches supported by a common address space. When functioning as a pure Harvard machine, the CPU executes instructions via the cache. When accessing external memory, it behaves like a von Neumann machine (code can move like data, functioning as a powerful skill). This improvement is prevalent in modern processors, such as ARM architecture, Power Architecture, and x86 processors. It is sometimes referred to as Harvard architecture, ignoring the fact that it has actually been "modified."
Another modification provides a pathway between instruction memory (such as ROM or flash memory) and the CPU, allowing words from instruction memory to be treated as read-only data. This technique is used in some microcontrollers, including Atmel's AVR. It allows access to constant data such as text strings or function tables without first copying them to data memory, thereby reserving scarce (and power-consuming) data memory for read/write variables. Special machine language instructions provide the capability to read data from instruction memory. (This is different from embedding constants within the instructions themselves, although for individual constants, the two mechanisms can be interchangeable.)
Characteristics of Influential Architectures#
For general-purpose CPUs, even though the Harvard architecture is efficient, its complexity and high costs determine that it is not suitable for general scenarios. The rapid market capture by x86 architecture also benefited from adopting the simple and practical von Neumann architecture. Thus, architectures that have a significant impact on the industry generally possess the following characteristics:
(1) Simplicity
Whether it is the Turing machine model or the von Neumann architecture, the first impression is its simplicity, meaning that the components constituting the architecture are relatively few, and the collaboration between components is also relatively simple, making products based on this architecture simple and easy to use.
(2) Completeness
The components constituting the architecture are just right, indispensable, and can completely solve all problems in the target domain without needing to add any additional elements.
(3) Generality
Architectures have certain applicable scenarios; overly specialized scenarios are unlikely to have an impact, so generality is a necessary condition for forming mainstream architectures.
(4) Trade-off
Popular architectures find the best balance between efficiency and cost. Overemphasizing technological advancement while neglecting costs is often the reason many products lose their market.
(5) Compatibility
A good architecture must be open and compatible. In fact, the von Neumann architecture and Harvard architecture are not absolutely opposing architectures. Modern von Neumann architecture chips have also absorbed many characteristics of the Harvard architecture, such as instruction caches in CPUs, which bear the shadow of the Harvard architecture.
(6) Scalability
The most important indicator to judge whether an architecture can become popular and sustainable is its scalability. A good architecture achieves scalability through simple replication, arrangement, and combination, allowing the most basic system to expand linearly, resulting in linear growth in functionality and performance.
In 1950, Turing published an article titled "Computing Machinery and Intelligence," in which he first proposed the standard for measuring whether a machine can think—the "Turing Test." In simple terms, Turing believed that if a person and a computer were isolated and communicated by typing on a screen, and after a continuous five-minute conversation, the person had a 70% chance of believing they were conversing with a human rather than a computer, then the computer would have passed the test and could be considered to possess human-like thinking abilities. If we look at the results of the two human-computer battles, where IBM's Deep Blue defeated Russian chess world champion Garry Kasparov with a score of one win and five draws on May 11, 1997, and Google's AlphaGo defeated Korean Go world champion Lee Sedol with a 4:1 overwhelming advantage in March 2016, although neither of these was a strict Turing test, few would doubt that computers can possess intelligence similar to that of humans. While computers still require input from human-written programs, many tasks that are relatively simple for humans, such as pattern recognition and sentiment analysis, remain challenging for computers. However, their deep learning capabilities and decision-making abilities are now undeniable.
It can be said that Turing predicted over 60 years ago that computers, as artificial inorganic entities, could gradually evolve into machines that rival human intelligence through the arrangement and combination based on the simple Turing machine architecture. Today, the development of big data, artificial intelligence (AI), and robotics has officially ushered us into an era of inorganic intelligence. One important idea in Kevin Kelly's "Out of Control" is that artificial inorganic entities are increasingly resembling living beings. The term "silicon-based intelligence" has emerged, which may be more powerful than "carbon-based intelligence," as the evolutionary speed of "carbon-based intelligence" cannot compare to the speed at which "silicon-based intelligence" develops according to Moore's Law. Therefore, many believe that inorganic intelligence surpassing human intelligence is an inevitable occurrence that will happen soon. It is not surprising that scientists like Stephen Hawking, entrepreneurs like Elon Musk, and Bill Gates have expressed concerns about the development of artificial intelligence. At the 2015 International Joint Conference on Artificial Intelligence (IJCAI) held in Argentina in July 2015, over a thousand scientists signed an open letter urging the United Nations to ban the development of autonomous weapons. Signatories included Hawking, Musk, and Wozniak (one of the founders of Apple), marking the first warning about the threats posed by artificial intelligence.
Some inspirations gained from the architect's work. Below, I would like to discuss some observations and views on the development trend from Internet+ to Blockchain+.
The Turing machine model laid the theoretical foundation for modern computers, while the von Neumann architecture has dominated computer development for over 60 years. During these 60 years, significant changes have occurred in macro-level IT architecture. From the perspective of the lowest level constituting an IT system, these major changes can be summarized as upper-level architectural adjustments made to break through the underlying von Neumann bottleneck in new scenarios. How can this be understood? From the perspective of a single computer, the limitations of the von Neumann architecture in terms of efficiency are evident, mainly manifested in serial execution of instructions, shared buses for programs and data, and separation of computation and data. The issues encountered at the IT level, including performance and scalability problems, need to be addressed by combining computers based on the underlying von Neumann architecture using specific protocols according to a certain architecture to meet the functional and non-functional requirements of the business as a whole.
If we review the development history of IT architecture, we can roughly divide it into several eras: the era of traditional centralized mainframes, the client/server (CS) era dominated by PCs, the Internet era, the cloud computing and big data era, and now the Internet+ era characterized by deep integration of Internet technology and traditional industries. Below, we briefly introduce the situation of each era.
In the centralized mainframe era, the emphasis was on the capabilities of single machines and vertical scalability. To better improve resource utilization, various virtualization technologies at all levels, including CPU time-sharing systems, memory virtualization, and computing virtualization, emerged during the mainframe era.
In the 1960s, American computer scientist John McCarthy, who first proposed the concept of artificial intelligence, introduced the concept of utility computing, predicting that computing could one day be provided as a public service like water and electricity. This is the cloud computing concept we are familiar with today.
The centralized management of mainframes and the virtualization technologies that emerged from them gave birth to the earliest concepts of cloud computing. However, with the popularization of information technology, mainframes could no longer meet the increasingly diversified and widespread demands for information processing in terms of cost, openness, and scalability.
The Arrival of the Open Era#
In the 1970s, with the emergence of the UNIX operating system and the C programming language, the ideas of openness and interconnectivity began to take root, bringing a glimmer of hope to the development of IT architecture. Additionally, the demand for communication between mainframes and terminals, as well as communication between mainframes, spurred the development of networking technologies. During this process, Ethernet technology and TCP/IP technology were successively born. Particularly in 1981, the International Organization for Standardization (ISO) proposed the basic reference model for open systems interconnection (OSI/RM), which is the famous OSI seven-layer architecture, laying the foundation for transparent communication between different applications on different systems.
Client/Server (CS) Distributed Era#
By the 1980s, the demand for personal computing led to the arrival of the PC era. With the proliferation of PCs, the demand for local area networking grew increasingly. The development of local area networks gave rise to the client/server (CS) distributed architecture. The client/server architecture essentially redistributes some of the workloads originally handled by centralized hosts to run on clients, alleviating the burden on the host and enhancing the overall system's parallel processing capabilities. This is akin to performing parallel computations on multiple small computers based on the von Neumann architecture to replace serial computations on a centralized mainframe without altering each computer's von Neumann architecture.
An important characteristic of distributed architecture is that computation and storage are distributed across multiple nodes in the network, with software controlling the distribution of tasks and execution scheduling, allowing tasks to be executed in parallel across multiple nodes while providing an interface environment for upper-layer applications to access remote nodes as if they were local systems. Early remote procedure calls (RPC), distributed computing environments (DCE), and common object request brokers (CORBA) were all distributed frameworks. These distributed frameworks emphasize the distribution of computation, and the scope of distribution is also within the enterprise's intranet.
The first distributed storage system was the distributed file system File Access Listener (FAL) developed by DEC in the 1970s. SUN released the first widely used distributed file system, NFS, in 1985. Subsequently, systems like AFS, KFS, DFS, IBM's GPFS, and SUN's Lustre emerged like mushrooms after rain.
The change brought about by distributed architecture not only saves costs but also improves efficiency, bringing computation closer to customers, rather than requiring customers to submit computational tasks at terminals connected to mainframes as in the past.
The Internet Era#
By the 1990s, the Internet began to gain popularity. The Internet broke through the limitations of local area networks, allowing computation to transcend the constraints of time and space. The CS architecture gradually transitioned to the BS architecture (Browser-Server). The BS architecture unified clients into browsers, enabling applications to run in any platform environment. During this period, the primary focus of IT architecture was on openness and portability, as IT system resources were extremely valuable, and there was a desire for applications to be developed once and run on all IT systems. This concept gave rise to Java, a highly portable high-level programming language.
Entering the 21st century, as applications proliferated, IT architecture began to emphasize specialization and collaborative division of components. A well-known architectural principle called "Separation of Concerns" (SOC) emerged, meaning that different problems should be handled separately using different components. During this period, a three-tier architecture called "Model-View-Controller" (MVC) became popular, effectively embodying the SOC principle. In MVC, the Model is responsible for data entity operations, the View is responsible for presentation, and the Controller is responsible for control logic. This three-tier architecture can be applied at various levels, from simple web applications to large enterprise CRM or ERP systems, and has been widely adopted.
The most famous HTTP protocol's primary drafter, Roy Fielding, proposed the architectural concept of REST (Representational State Transfer) in his doctoral dissertation completed in 2000, which laid the architectural foundation for the Internet. The term "Representational State Transfer" may not be easily understood at first glance. Essentially, it represents all resources in the world with a unique identifier (Universal Resource Identifier, URI), and for each request from the client, the server responds with a representation of the resource's state rather than the resource itself.
For example, when a user browses a news article on a webpage, the request sends the URI pointing to the news resource to the server. The server will find the news resource based on the URI and return its "representational state," which is the news content written in HTML or XML format, to the client; while the news resource itself remains on the server.
In the REST architecture, there is no contextual association between multiple requests from the client/server; in the above example, the intrinsic representational state of the news resource and the representational state returned to the client have no necessary connection. For instance, the server's news resource may be stored in a database, and the user may subsequently send a request to watch a video, which is completely unrelated to the previous request, lacking any contextual relationship. Upon receiving the request, the server sends the video resource's "representational state," which is the video stream, to the client. Users can also issue commands to modify the resource's "representational" state or delete the resource. The absence of a "context" relationship means that requests are stateless, which also implies that there is no need to worry about state management, facilitating the expansion of the foundational infrastructure supporting the Internet, such as adding web servers to enhance request processing capabilities.
The REST architecture is the best choice for hypermedia browsing. Hypermedia organizes various texts or multimedia information from different spaces into a web-like media using hyperlinks, and web pages on the Internet are a form of hypermedia. The REST architecture has unparalleled advantages in scalability, ultimately becoming the mainstream architecture of the Internet. The rise of the REST architecture can be attributed to its simplicity, compatibility, and scalability in architecture. The SOAP-based web service architecture promoted by IBM and Microsoft was gradually marginalized due to its excessive complexity.
Distributed systems are more complex than centralized systems because the issues they must address include consistency, availability, and partition tolerance. Consistency refers to the ability to read the most recently written data at every node at the same time. Availability means that a running node can always respond to requests within a reasonable time without errors or timeouts. Partition tolerance means that the system can continue to operate even when network partitions occur. These issues are not significant problems in centralized systems. However, in distributed systems, especially large-scale Internet distributed systems, they become the biggest challenges.
In 2000, Professor Eric Brewer proposed a conjecture stating that consistency, availability, and partition tolerance cannot all be satisfied simultaneously in distributed systems, and at most, only two of the three can be met. This conjecture was later proven and elevated to the well-known CAP Theorem. The CAP theorem provides insight for distributed system designers: the design of any distributed system must make trade-offs between consistency, availability, and partition tolerance based on different application scenarios, choosing any two of the three, as one cannot have it all.
Traditional enterprise-level IT applications primarily handle transactional data, such as accounting and inventory data, making consistency a fundamental requirement. Most enterprise applications need to meet ACID requirements. A stands for Atomicity, meaning a transaction must either complete entirely or not at all, without being in an intermediate state; C stands for Consistency, meaning related data must remain consistent before or after a transaction; I stands for Isolation, meaning different transactions must be completely independent and isolated from each other; D stands for Durability, meaning data after a transaction must be persistently saved.
In the Internet era, Internet applications primarily handle interactive data, such as shared web pages and images. Interactive data is much larger in volume than transactional data, and these data do not have high consistency requirements. However, the demand for computational processing power is much greater than that of traditional computing. The Internet environment differs from the enterprise intranet environment, where network issues are the norm rather than exceptions, leading to high requirements for availability and partition tolerance in Internet applications. Most Internet applications need to meet BASE requirements. BA stands for Basic Available, meaning that even if a node fails or a network disconnection occurs, the system should still function normally without errors; S stands for Soft State, meaning temporary states are allowed to be inconsistent at certain nodes at a given moment; E stands for Eventual Consistency, meaning that the system will eventually achieve consistency through compensation mechanisms or other error correction mechanisms.
Compared to ACID applications, BASE applications have better scalability and are more suitable for running in distributed systems on the Internet. How to balance the three aspects of CAP in the Internet? How to support BASE applications? How to meet the computational power demands of processing the ever-growing massive data?
Cloud Computing and Big Data Era#
- Google's Architectural Transformation
From 2003 to 2004, Google published three papers on GFS, MapReduce, and BigTable, essentially revealing the platform architecture used internally by Google to process massive search data. GFS is a large-scale distributed file system, MapReduce is a programming model under a parallel processing framework, and BigTable is a non-relational database organized by key-value pairs built on GFS. Due to the inability of existing technologies, products, and platforms to meet Google's rapidly growing business needs, Google boldly innovated based on the characteristics of its search business, breaking the traditional distributed file system's constraints and developing a fault-tolerant distributed file system that supports large-scale scalability, building a parallel computing platform and distributed database on top of it, enabling Google's search platform to handle unprecedented and explosively growing massive data.
In particular, the parallel processing programming framework of MapReduce automatically splits data through software and assigns tasks to different nodes, achieving automatic scheduling, load balancing, and automatic monitoring, while automatically repairing errors and managing communication between nodes. Traditional parallel processing applications required developers to master skills like MPI programming, generally limited to high-performance computing fields. The MapReduce framework simplified the programming of parallel processing systems, significantly lowering the barrier for developers to create parallel processing systems. It can be said that the MapReduce framework allows computers based on the von Neumann architecture, which inherently lacks parallel computing capabilities, to rejuvenate through cluster parallel processing in the Internet era.
Google's three papers laid the architectural foundation for large-scale distributed systems on the Internet, opening the curtain on the era of big data. Google's contribution mainly stems from its own business needs, proposing a completely new architecture for distributed storage, distributed parallel computing, and distributed databases based on a comparison of the advantages and disadvantages of traditional distributed architectures. However, its characteristics still lie in a centralized management model for scalable distributed systems.
- Amazon's Architectural Transformation
Google was the first company to propose the concept of cloud computing, and another pioneer of the cloud computing business model, Amazon, also made strides by publishing the Dynamo distributed database paper in 2007. Similar to Google, Amazon innovated based on its business characteristics, treating system errors as a norm; however, unlike Google, Amazon adopted a decentralized, fully distributed architecture.
Amazon's Dynamo paper disclosed the design and implementation details of the distributed key-value database Dynamo. The design of Dynamo primarily targets large-scale e-commerce application scenarios, such as shopping carts, requiring "Always on" availability, allowing users to modify at any time, thus providing a high-availability customer experience. Its design goal prioritizes availability, sacrificing consistency in certain situations. The Dynamo paper explicitly introduces the concept of "Eventual Consistency." Its design philosophy references Peer-to-Peer architecture, employing a decentralized architecture for the entire distributed system. Dynamo integrates several well-known technologies to achieve scalability and availability: data partitioning and replication using consistent hashing, and providing consistency through object versioning. During updates, consistency between replicas is maintained by a quorum-based centralized replica synchronization protocol. Dynamo involves three important parameters: N represents the number of data replicas, W represents the minimum number of nodes that must successfully write during a write operation; R represents the minimum number of nodes that must successfully read during a read operation. The requirement is W+R>N; when reading data, as long as R-1 nodes return data, excluding the Coordinator, it is considered a successful read (multiple versions of data may be returned). Similarly, when writing data, as long as W-1 nodes write successfully, excluding the Coordinator, it is considered a successful write. Dynamo employs a gossip protocol for distributed fault detection and membership protocols. Dynamo requires minimal manual management, allowing storage nodes to be added and removed without any manual partitioning or redistribution. Dynamo has long been the core service's underlying storage technology for Amazon's e-commerce platform, effectively scaling to extreme peak loads without any downtime during busy holiday shopping seasons.
Dynamo and BigTable are both non-relational databases, commonly referred to as NoSQL databases. However, their design philosophies differ significantly. Dynamo is entirely decentralized, assuming deployment in a trusted internal network without security measures. In contrast, BigTable is centrally managed, utilizing access control to provide security measures. Dynamo's data model is a key-value model, while BigTable is a multidimensional sorted map. Dynamo employs consistent hashing for distributed metadata management, while BigTable uses centralized metadata management. Their applicable scenarios also differ. Dynamo primarily targets e-commerce shopping cart applications, emphasizing high availability with lower consistency requirements, making it a typical AP database. BigTable, on the other hand, has higher requirements for consistency and scalability, making it more suitable for handling structured data, thus being a typical CP database.
- Characteristics of Cloud Computing Architecture
The cloud computing giants Google and Amazon have ushered in the era of cloud computing and big data. However, the concepts of cloud computing and big data quickly became buzzwords exploited by vendors, leaving many customers confused.
The U.S. National Institute of Standards and Technology (NIST) provided a relatively objective and time-tested definition that can clarify many misunderstandings. NIST defines cloud computing as "a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (including networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction."
NIST further summarized a "three-four-five" key points to elaborate on the concept of cloud computing: "three" refers to the three service models of cloud computing (Infrastructure as a Service IaaS, Platform as a Service PaaS, Software as a Service SaaS); "four" refers to the four deployment models of cloud computing (private cloud, public cloud, hybrid cloud, community cloud); "five" refers to the five characteristics of cloud computing (on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service).
Earlier, we mentioned that the characteristics of architecture are not merely a simple combination of technologies; a significant aspect is weighing efficiency against cost. In other words, among the many considerations in architecture, economic factors are crucial. An architecture lacking a reasonable economic model is unlikely to become mainstream, as evidenced by the comparison between the von Neumann architecture and the Harvard architecture, which also holds true for the characteristics of cloud computing architecture.
NIST's definition of cloud computing is still too formal and difficult to understand. To explain in simpler terms, cloud computing is essentially a service model that uses fault-tolerant, parallel scheduling software to pool large-scale, inexpensive standard industrial servers, transforming the IT capabilities of the resource pool (including computing power, network capabilities, storage capabilities, and application capabilities) into services provided in an elastic, on-demand manner. There are several noteworthy aspects:
-
Cloud computing integrates centralized management capabilities based on distributed architecture, exhibiting centralized architectural characteristics. In the cloud computing era, most business logic and data processing are centralized on cloud platforms running in large data centers, with mobile devices primarily focused on presentation. The browser/server (BS) architecture of the Internet era is gradually transitioning to a client/cloud (CC) architecture.
-
Cloud computing architecture is a trade-off between cost and efficiency. By using inexpensive standard industrial servers instead of costly branded devices, hardware costs are reduced, while software fault tolerance compensates for the quality issues of cheap hardware, and parallel and virtualization technologies enhance resource utilization efficiency.
Cloud computing is a convenient service model that provides computing, networking, resources, and the IT capabilities built on top of these. This model forms unique architectural characteristics of cloud computing. These architectural characteristics include Service Oriented Architecture, Resource Pooling, Software Defined, Commodity Hardware, Measured Service, Horizontal Scaling, and Elasticity. Among these, software-defined and standardized commodity hardware have the most significant impact on traditional IT. One important feature of software-defined networking, software-defined storage, and software-defined security is the separation of control and data layers, decoupling control software from hardware, which greatly enhances system openness, scalability, and flexibility, making management more convenient. Another economically driven factor is the significant reduction in costs, causing traditional IT networking, storage, and security vendors to lose their protective barriers of specialized hardware. Initiatives like the Open Compute Platform (OCP) and Project Scorpio shift the power from traditional vendors to end users. It is not difficult to imagine that traditional vendors like Cisco and EMC, which primarily produce branded products, will face significant challenges from "white box" manufacturers.
The Roots of Big Data#
The concept of Big Data was first proposed in a report titled "Big Data: The Next Frontier for Innovation, Competition, and Productivity" published by McKinsey & Company in May 2011. The definition of big data provided in the report is: Big data refers to datasets that exceed the conventional capabilities of database tools for acquisition, storage, management, and analysis.
Once the concept of big data was introduced, it quickly dominated media headlines, leading to a situation where various industries frequently mentioned big data. Many traditional BI and data warehouse solutions were repackaged as big data solutions, and even many projects unrelated to big data were marketed as such. Big data quickly became a buzzword.
The International Data Corporation (IDC) defines big data based on four characteristics: massive data scale (Volume), rapid data flow and dynamic data systems (Velocity), diverse data types (Variety), and enormous data value (Value). Amazon's big data scientist John Rauser provided a simple definition: big data is any volume of data that exceeds the processing capacity of a single computer.
In fact, the definition of big data may not be very important, as different perspectives can yield different definitions. However, one undeniable phenomenon is the exponential growth of data volume, which presents unprecedented challenges to humanity. This challenge manifests in computational power and the energy consumption required to support computation. Because computation requires energy support. Theoretically, data growth will always double in increasingly shorter timeframes, but energy cannot keep pace. The end result is that the generated data may be wasted. According to Cisco's predictions, global data center IP traffic is expected to reach 10.4ZB annually by 2019, averaging 863EB per month, nearly three times the 2014 figure (which was 3.4ZB, averaging 287EB per month). This growth rate will directly drive innovation and transformation in IT architecture.
So why did the challenges of big data suddenly emerge in the early 21st century? In fact, the saying "it takes three feet of ice to freeze three inches of water" applies here; the phenomenon of big data is the result of decades of informationization and digitization in human society. Looking back 20 or 30 years, most communication signals were analog. However, after the digital revolution, most signals transitioned from analog to digital. The development of IT accelerated the digitization process. Early computers merely digitized business and management information; the Internet and mobile Internet digitized human communication; and in the era of the Internet of Things, the data generated from interactions between objects will surpass that of the previous stages.
The emergence of big data demands greater computational processing capabilities, thereby driving the development of IT architecture. Google's three papers were produced against this backdrop.
Currently, most people see the opportunities brought by big data, hoping that through big data analysis, they can more accurately grasp customer needs, better understand market changes, and respond more quickly to business changes through data-driven decision-making. If the processing capabilities of big data require support from cloud computing, then big data analysis must combine with industry knowledge to establish corresponding industry big data analysis models. The deep integration with industries has also given rise to the arrival of the Internet+ era.
The Internet+ Era#
The Internet+ era is essentially an extension of the cloud computing and big data era, representing a stage of deep integration of cloud computing and big data technologies with industries, fundamentally signifying the digital transformation of enterprises.
- Internet+—Digital Transformation of Enterprises
IDC believes that the IT industry is entering a third platform era represented by cloud computing, mobile interconnectivity, big data, and social media. IBM has also proposed a comprehensive transformation towards CAMSS (Cloud, Analytics, Mobile, Social, Security). In China, "Internet+" became the most widely recognized term in 2015. From national leaders to ordinary citizens, everyone is discussing "Internet+." An important characteristic of "Internet+" is achieving the digital transformation of traditional enterprises, with cloud computing serving as the architectural and platform foundation for this transformation.
There are many debates regarding the connotation and extension of "Internet+." A relatively convergent view is that "Internet+" is an important stage in the transition from consumer Internet to industrial Internet, characterized by traditional industries adopting "Internet thinking" to innovate business models, utilizing Internet and big data technologies to provide better value services and products through close integration of online and offline channels. This has led to various models combining Internet technologies with industries, such as Internet+ finance, Internet+ manufacturing, Internet+ education, Internet+ transportation, Internet+ energy, and so on.
- Internet+ Finance—Fintech
Among these, Internet+ finance has become the most attention-grabbing field. Internationally, the corresponding concept for Internet+ finance is called Fintech (meaning "financial technology" in Chinese). Fintech originally stemmed from IT technologies used in the back-end of large financial enterprises, including accounting systems, trading platforms, payment, settlement, and clearing technologies. In the Internet era, the concept of Fintech has expanded to cover IT technologies supporting business innovation in the financial industry, including P2P lending, crowdfunding, mobile payments, virtual currencies, and customer behavior big data analysis.
Typical representatives include P2P lending platforms like Prosper and LendingClub, mobile payment systems like Google Wallet, Apple Pay, Alibaba's Alipay, and Tencent's WeChat Pay, overall fund management platforms like Mint, intelligent financial advisors (Robo-advisors) like LearnVest, and Bitcoin.
- Blockchain—The Crown Jewel of Fintech
Earlier, we discussed the history of IT architecture development, which has evolved from centralized mainframes to CS distributed systems and now to cloud computing centralized systems. As the saying goes, "what is separated will eventually come together, and what is together will eventually separate." History does not simply repeat itself but develops in a spiral trajectory. We see that Google's three papers are all about managing distributed computing through centralized architecture. The advantage of this is the unified management of metadata and scheduling while ensuring consistency. In contrast, Amazon's Dynamo architecture has clear decentralized characteristics. One significant issue with centralized architectures is the performance bottleneck of managing nodes, which can easily become targets for attacks. Another crucial problem is the high cost of establishing and maintaining trust in a large distributed environment.
Centralized architectures also face a critical issue: if the individuals managing the central nodes make subjective errors, have integrity issues, or lose independence due to third-party influences, it can have catastrophic effects on the entire network. The inventor of Bitcoin, Satoshi Nakamoto, left a message in the first mined Bitcoin block on January 3, 2009: "Chancellor on brink of second bailout for banks." This statement was the headline news in the UK’s Times that day, reflecting the escalating global financial crisis. According to the author of "Bitcoin and Cryptocurrency Technologies" published by Princeton, Nakamoto developed a completely decentralized virtual currency system out of dissatisfaction with the centralized banking system's excessive issuance of currency and unrestrained credit expansion. From the beginning, Nakamoto opened the Bitcoin system's source code, and the Bitcoin system is not controlled by anyone, with a total currency issuance fixed at 21 million Bitcoins, gradually released according to specific rules. Thus, Bitcoin, like gold, possesses a certain scarcity and is a non-inflationary virtual currency.
Since its launch in 2009, Bitcoin has been running continuously for over seven years. The Bitcoin experiment has demonstrated that a completely decentralized distributed architecture can establish trust in unfamiliar environments through appropriate economic models (mining incentives) and consensus algorithms. This circumvents the fatal weaknesses of centralized distributed architectures at central nodes. Additionally, the underlying blockchain architecture of Bitcoin addresses a problem that the Internet cannot solve: the high costs of establishing and maintaining trust. Furthermore, blockchain uses cryptographic signatures and hashing algorithms to resolve issues of counterfeiting that are difficult to address on the Internet. Another less noticed unique aspect is that computations on the blockchain require "fuel" (Gas) or transaction fees, meaning that computation is tied to the costs that support it. This is significantly different from traditional IT architectures, where there are no financial elements. The hidden danger in traditional IT architectures is that computations can be attacked through computation, which is also the reason why "Denial of Service" (DDoS) attacks cannot be completely eradicated on the Internet. In contrast, the possibility of DDoS attacks on the blockchain is greatly reduced because launching a DDoS attack requires a substantial reserve of virtual currency. This presents significant disadvantages for hackers in terms of both cost and the concealment of attack sources. Therefore, blockchain is a naturally IT architecture that is closely integrated with finance.
More importantly, by combining scripting engines, cryptography, and virtual currency mechanisms, blockchain can achieve payment, automatic settlement, and clearing. Thus, blockchain has been referred to as the "Internet of Value" by Alex Tapscott, CEO of Northwest Passage Ventures.
Therefore, the significance of blockchain is self-evident. Especially for the financial industry, credit risk is an inescapable nightmare in traditional finance. However, blockchain offers an almost perfect solution to address credit risk. Consequently, blockchain technology is regarded as a disruptive technology for the next generation of the Internet. The Wall Street Journal even declared that blockchain is the most significant breakthrough in the financial sector in the past 500 years. Thus, blockchain can be said to be the crown jewel of the Fintech field.
Based on past history, we can boldly hypothesize that the future solution to this problem will not be a single technology but rather a new architecture that integrates multiple technologies. We can further hypothesize that in this new architecture, quantum computing can address computational capability issues; neural computing can tackle intelligent cognition issues; and, crucially, blockchain can resolve issues of computer and robot behavior norms and self-governance.
With this in mind, the future is still full of opportunities, and we are filled with anticipation and hope for what lies ahead.