1. What is AlephBFT?
AlephBFT is a Rust implementation of the Aleph Consensus Protocol that offers a convenient API allowing it to be easily applied to various problems. The prime application of AlephBFT is to be the consensus engine (sometimes called the "finality gadget") of the Aleph Zero blockchain.
1.1 High level idea of what AlephBFT does.
From the high level perspective AlephBFT allows a set of N
prespecified nodes to agree on an ordering of items arriving in some unreliable streams of data that these nodes locally observe. An illustrative example to have in mind here is when these nodes all observe a growing blockchain that does not have a built-in notion of finality, and would like to finalize blocks. Then the above mentioned "streams of data" are simply the advancing sequences of blockchain "tips" that each of the nodes sees. Note that in this (and in all the interesting examples), because of possible forks, or network delays, the data streams of individual nodes might not be consistent.
The goal of a consensus protocol, is to ensure consistency of the decisions, even though relying on an unreliable data source. Consequently, what AlephBFT produces is a single stream of data that "combines" all the individual streams of the N
nodes and importantly is consistent among the nodes. Thus, in the example above, all the nodes would produce a unique sequence of finalized blocks (see also the corresponding AlephBFT API section for a more detailed description on how to use AlephBFT as a finality gadget for a blockchain).
1.2 High level idea of AlephBFT requirements.
Let us index the nodes taking part in the protocol as i = 0, 1, ..., N-1
and call them "the committee", below we list some high-level requirements to be able to run AlephBFT among this nodes:
- The nodes are connected via a network (that is not assumed to be 100% reliable) allowing them to send arbitrary messages to each other,
- Each node knows the identities of all other nodes (via their public keys) and holds a private key allowing it to sign messages.
- Each node
i
holds a data source object (to be explained in detail in the DataIO subsection of AlephBFT API section -- seeDataIO
) that allows it 1) to receive fresh pieces of data (we refer to it as the input streamin_i
), and 2) to check that a piece of data received from another node is "available". The availability is best understood when thinking about the blockchain example and data being block hashes. Then the availability question for a blockhash is essentially whether we locally hold a block with such a hash. - At most
(N-1)/3
of the nodes can be malicious (act with the intent to break the protocol guarantees).
1.3 High level idea of AlephBFT guarantees.
AlephBFT guarantees (as long as at most (N-1)/3
nodes act maliciously) that each node i
running the protocol produces a stream of data out_i
such that:
- The output streams of any two nodes
out_i
andout_j
are consistent, in the sense that at any given time one is a prefix of another. Moreover, none of the streams gets "stuck", they keep producing items until the protocol is shut down. So, intuitively, all the streams produce the same items, but just at possibly different paces. - Each item in the output stream is "available" to at least half the honest (= not malicious) nodes in the committee.
- Roughly speaking, most of the data items "proposed" by honest nodes (i.e., data coming from
in
streams of honest nodes) eventually land in the output stream.
2 How AlephBFT Does it?
AlephBFT implements Aleph Consensus Protocol -- we highly encourage you to read the original paper, as it offers the best explanation of the ideas and does not require much background to understand. In this documentation we try to offer a self-contained sketch of the protocol, although, because of its compactness it might be harder to understand. For those who have read the paper we would like to note that there are certain minor differences between the paper and the actual implementation that we highlight in a dedicated section. However, these do not affect any of the fundamental ideas behind AlephBFT or its security guarantees and are mostly tricks and implementation details.
Note: for clarity, given N
we define f
to be the maximum number of malicious nodes allowed in AlephBFT: f=floor(N-1)/3
. For instance, for N
's equal to (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
the corresponding f
s are (0,0,0,1,1,1,2,2,2,3)
. Then the number of honest nodes is guaranteed to be at least N-f = floor(2*N/3)+1
which is strictly more than 2*f
.
2.1 Dag -- the main idea of AlephBFT.
Recall that in the model we are interested in, the nodes keep receiving new data items from their input streams in_i
and would like to produce a single output stream out
(i.e., the unique stream consistent with their local output streams out_i
). The main idea of AlephBFT is for the nodes to produce a DAG (Directed Acyclic Graph) of vertices that we call Units where each such unit is created by one of the nodes i
and carries one piece of data from the input stream in_i
. Whenever a unit is created by a node i
it is then dispersed among all other nodes with the intention that each node at any given time has a sufficiently up-to-date version of the Dag. AlephBFT then specifies a combinatorial algorithm that given a Dag D
allows to compute (statically, without any need for communication) a certain prefix of the output stream out
, with the property that as the Dag D
is updated with new vertices to form D'
then the new prefix computed out of D'
is either the same or extends the one obtained from D
. In other words, by growing the Dag, we cannot "roll back" the output sequence.
2.2 Constructing the Dag -- Units.
2.2.1 Unit Structure.
The structure of the Dag in the AlephBFT implementation is especially simple (some simplification here w.r.t. to the Aleph Paper). Each unit U
(vertex in the Dag) holds the following fields:
- The
creator
ofU
is simply a node index in[0, N-1]
specifying who created this unit. - The
data
inU
is an element of someData
type -- this is supposed to be the most recent data itemU.creator
received from its input stream prior to creation ofU
. - The
round
ofU
is a non-negative integer specifying the sequential number of the unit among the ones created byU.creator
. It is convenient to think thatround
can be arbitrarily large, but for technical reasons it is bounded by a fixed constantMAX_ROUND
. - The
parent_map
ofU
is simply a bitarray of lengthN
that specifies some subset of nodes{1, 2, ..., N}
. The meaning of this set is that ifi
belongs toparent_map
thenU
has an edge to a unitV
created byi
in roundU.round - 1
. - The
control_hash
ofU
or rather of parents ofU
is a fingerprint of the list of actual units being parents ofU
. While AlephBFT protocol instructs honest nodes to create exactly one unit per each round, this behavior cannot be enforced among malicious nodes. This is the reason why theparent_map
array by itself does not necessarily uniquely specify the parents ofU
and we thus attach acontrol_hash = Hash(U_1, U_2, ..., U_k)
whereU_1, ..., U_k
are the parents ofU
. - The
signature
is a signature under the unit (as a collection of all the above fields) generated byU.creator
. The signature is a proof that a given node indeed created such a unit, all units must be sent along with signatures.
By parents(U)
we denote the set of all units that U
has edges to (these are always units of round one less than U
's round). By Hash(U)
we mean the hash of a serialization of the fields of U
not including the signature
. Since Hash(U)
is smaller than U
and still uniquely specifies U
we sometimes use hashes in the protocol instead of units.
2.2.2 Unit Creation Rules.
Below we provide a pseudocode for creating units that the honest nodes are expected to follow. In what follows D
is the local copy of the Dag, i.e., the set of units that a particular node holds. We assume that a node can add a unit to D
only if parents(U)
are all in D
(this requirement of course creates some challenges in the implementation, but we disregard them here).
/* Written from the perspective of node k. D is a global variable denoting the set of units held by k. */
def can_create(r):
if round == 0:
return true
if D does not contain a unit created by k in round r-1:
return false
if D contains less than N-f units from different creators in round r-1:
return false
return true
def creator_task():
for r = 0, 1, 2, ...:
loop:
//since D is growing, the below if will eventually fire
if can_create(r):
Create a unit U of round r:
Pick as its parents at least N-f units (from different creators) of round (r-1), including node k previous unit.
Poll the input stream for a fresh data item `d` and place it in `U` in the `data` field
Send U to all other nodes
sleep(CREATE_DELAY)
break the inner loop
The creator_task
keeps creating units of growing rounds, with the requirement that, roughly, there are at least (2/3)*N
units in the previous round available in D
. There is a certain delay after each unit is created -- this technicality allows us to make sure the Dag does not grow too quickly, which is problematic for pragmatic reasons: the nodes have limited bandwidth (which also varies between different nodes) and have bounded RAM memory, where we would like to store the Dag.
2.2.3 Malicious Units: Forks (Equivocations).
Note that from the pseudocode of the creator_task
it follows that each node is supposed to create exactly one unit per round, so ideally a Dag of "height" R
should have R*N
units in total. In the version of the AlephBFT protocol that is described in the main body of the paper, this rule (one unit per node, per round) is enforced via a special broadcast protocol: Reliable Broadcast for disseminating units among other nodes. Roughly speaking, it guarantees that there will always be at most one unit per creator, per round. This is extremely convenient for making the protocol simple and clean, but unfortunately implementing the protocol this way (with reliable broadcast, which is quite heavy) is rather inefficient. For this reason AlephBFT implements a version of the protocol (referred to as QuickAleph in the paper) that "allows" malicious nodes to create multiple versions of a unit for a given round (we call these forks). This makes the protocol slightly more complex, but much more efficient in practice. One can think of QuickAleph as an "optimistic" version of Aleph that assumes that all nodes typically behave honestly and fork only in extremely rare cases, at which time the protocol, in a sense, falls back to Reliable Broadcast in order to alert all the nodes that a forker has been detected (we give a little more details on that in the section on Reliable Broadcast).
2.2.4 Disseminating Units.
From now on we assume that whenever a unit U
lands in a Dag D
of an honest node k
then all other honest nodes will eventually (maybe after some delay) receive U
and place it in their copies of the Dag. To achieve this in practice there are several mechanisms in AlephBFT to guarantee robustness of the process of disseminating units:
- Firstly, the creator broadcasts the unit.
- Secondly, all nodes periodically broadcast top known units for all other nodes. This only happens if a node didn't produce a unit for some time, because otherwise we can assume that other nodes received the newest unit in a regular broadcast.
- Thirdly, there is a request-response mechanism that allows nodes to fetch missing units from other nodes.
2.3 Computing the Ordering from Dag.
So far (in Section 2.2) we learned that the main object in AlephBFT is the ever-growing Dag composed of units created by different nodes. We ensure (and thus assume this from now on) that all the honest nodes, at every moment in time, hold a reasonably up-to-date version of the Dag. We now proceed to describe an algorithm OrderData
, which given a Dag D
(a static object), computes a sequence (for instance Vec<Data>
in Rust) of Data items (coming from some units in the Dag) which has the following fundamental monotonicity property:
If D
is a subset of D'
then OrderData(D)
is a prefix of OrderData(D')
.
Let us pause for a minute to realize how strong the property is, and that it already solves the consensus problem. Let us denote by D_1
, D_2
, ..., D_N
the local Dags of nodes 1, 2, ..., N
, and by order_1
, order_2
, ..., order_N
the results of applying the OrderData
algorithm to these Dags. Further let D
be the a dag being the union of all D_i
's for i=1, 2, ..., N
. From the property above it follows that if order = OrderData(D)
then each order_i
is a prefix of order
, and hence consequently order_i
also agree with each other! Given now such an algorithm OrderData
we can complete the description of the AlephBFT protocol: each node simply keeps its local dag D
up-to-date and recomputes OrderData(D)
over and over, the ever-growing sequence od Data items obtained from this procedure is this node's output stream.
Note: The description of OrderData
provided below is nowhere near to what happens in AlephBFT's code. This description is optimized for ease of understanding the main concept. An efficient implementation of the idea is another pair of shoes. In fact, the name OrderData
is not even used in the code -- the ordered data instead lands in an output stream that is produced asynchronously.
Before we dive into the details and the pseudocode, let us first sketch an overall picture of what is going to happen in OrderData
. First of all, there is a procedure of choosing a Head
for each round in the Dag -- it essentially picks a single unit out of each round. The heads are then used to order all the units in the Dag: simply go over all the rounds r=0, 1, 2, ...
and place the head head_r
for this round along with all the units "below" the head that have not been ordered yet in a new batch. To choose heads we use virtual voting: the idea is that for each unit U
in the dag, the units with round >= U.round
cast virtual "votes" (computed deterministically from the dag structure) on whether U
should be chosen as head or not. The voting rules are crafted in such a way that at a certain round (typically round 3 or 4) above U
, all units votes are already unanimous: either all vote "yes" or all vote "no". This determines the "decision" for U
. Then the head for a given round is chosen to be the first unit in the round (according to some deterministic ordering) that was decided as "yes". We strongly encourage the reader to inspect the paper for more intuitions.
2.3.1 Description of OrderData.
For a unit U
in a certain dag let us denote by below(U)
the set of all units that can be reached from U
by following 0 or more forward edges from U
, so this set consists of U
, parents of U
, parents of parents of U
, and so on. Below we describe OrderData(D)
based on a procedure Head(r, D)
that given a round r
and dag D
outputs an element of type Option<Unit>
thus either None
or Some(U)
where U
is some unit in D
of round r
. The description of Head(r, D)
follows later.
def OrderData(D):
order = []
batch = []
for r = 0, 1, 2, ... :
let head = Head(r, D)
if head is None:
break
let Some(U) = head
new_batch = below(U)
remove from new_batch the contents of batch[0], batch[1], ..., batch[r-1]
sort new_batch using any canonical ordering
batch[r] = new_batch
let [U_1, U_2, ..., U_k] = new_batch
order = order ++ [U_1.data, U_2.data, ..., U_k.data]
return order
The OrderData
algorithm is indeed quite simple: we choose a Head
unit from each consecutive round (until it is no more possible) and generate a new batch of units, by taking all units below the head, and removing units from previous batches. The ordering is simply the data items from the units in subsequent batches, where each batch is ordered arbitrarily (yet using some deterministic ordering). It is not hard to observe that OrderData
has the desired monotonicity property as long as the black-box Head(r, D)
is also "monotone", in the sense that:
If D
is a subset of D'
and Head(r, D)=Some(U)
then Head(r, D')=Some(U)
.
We proceed to describing Head(r, D)
2.3.2 Description of Head.
We will again describe Head
as a procedure using another black-box: Decide(U, D)
which takes a unit U
and a dag D
and outputs a value of type Option<bool>
, so either a boolean "decision" Some(b)
or None
. We assume here that Decide(U, D)
satisfies an analogous property to the one mentioned for Head
above.
def Head(r, D):
if the highest round of a unit in D is < (r+3):
return None
let [U_1, U_2, ..., U_s] be the units in D of round r sorted according to some canonical order
// The order is allowed to depend on r and on Head(r-1, D)
for l = 1, 2, ..., s:
if Decide(U_l, D) == Some(b):
if b==true:
return Some(U_l)
else:
continue the loop
else:
return None
// The execution can never reach this line because at least one unit in round r must be decided true.
To compute the head for a given round r
we test all the units of this round in the dag D
in some arbitrary canonical order (ordering by unit hashes is fine, but more sophisticated orderings here might give additional defense against delays caused by malicious nodes). The first unit that is decided true
in this list is decided to be the head, however if Decide()
outputs None
at any time, Head
must output None
as well.
2.3.3 Description of Decide.
Finally we describe the Decide
procedure which is the last remaining piece to have a working consensus protocol. Again, Decide
is based on a black-box DecideVia(U, V, D)
which we describe subsequently:
def Decide(U, D):
if there exists any unit V in D such that DecideVia(U, V, D) == Some(b):
// In this case it is guaranteed that all bs will be the same
return Some(b)
else:
return None
To explain DecideVia
we first need to define "virtual voting" (we refer to the paper for intuitions and clarifications) -- intuitively, the units that are "beyond U" (i.e. have higher round number) "vote" for whether we should decide true
or false
for U
. It is good to think that this vote decides, whether "U
is well known among nodes". We again emphasize that the decision for U
depends on the static structure of the dag D
only and except from them impact U.creator
had on the shape of D
he cannot affect the decision in any other way.
def Vote(U, V, D):
// It is assumed that V.round > U.round
round_diff = V.round - U.round
if round_diff == 1:
return [U belongs to parents(V)]
else:
let votes be the multiset of Vote(U, P, D) for P in parents(V)
if votes are all the same, and equal to b:
return b
else:
return CommonVote(round_diff)
def CommonVote(round_diff):
if round_diff <= 4:
if round_diff == 3:
return false
else:
return true
else:
return [(round_diff % 2) == 1]
def DecideVia(U, V, D):
round_diff = V.round - U.round
if round_diff < 3:
return None
threshold = N-f
cv = CommonVote(round_diff)
let votes be the multiset of Vote(U, P, D) for P in parents(V)
if among votes there are >= threshold votes equal to cv:
return Some(cv)
return None
We are not going into details here why the above guarantees the kind of monotonicity properties that we defined above. For that we refer to the paper, let us however offer a quick argument why we should expect consistency among decisions. The idea here is that once the condition in DecideVia
is satisfied, i.e., among parent votes there is at least threshold
of them that agree with the CommonVote cv
, then one can easily prove that starting from the next round, all units are going to vote for cv
. Once that happens, it is easy to see that each unit beyond that will make a decision consistent with cv
. For proper proofs we refer to the Appendix of the paper.
2.4 Randomness In AlephBFT.
One of the differences between the paper and the AlephBFT implementation is in the use of randomness. The AlephBFT protocol in the paper uses randomness in two parts:
- The
CommonVote(round_diff)
forround_diff >= 5
is not deterministic, but is a random bit that is obtained from a seed that is generated as part of the protocol execution specifically for the unitU
the CommonVote concerns (thus in particular theCommonVote
function takes a unitU
as an additional parameter). - The ordering over units in
Head(r, D)
is also random and seed is generated specifically for the roundr
.
The corresponding seeds are generated with an appropriate timing, so that the randomness cannot be predicted well in advance. In the AlephBFT implementation both the common votes and the permutation of units are deterministic. The consequence of this is that the version of the protocol implemented in AlephBFT does not have the theoretical property called Asynchronous Liveness. This property means that in a theoretical scenario when the whole network is under control of a powerful adversary, who can schedule all the packets (even from honest nodes) according to its liking, the protocol should still make progress in producing the output stream. That being said there are two imporant comments to be made here:
- Even without randomness AlephBFT is Asynchronously Safe and enjoys all the properties of the state-of-the art partially synchronous protocols such as HotStuff, Tendermint or Streamlet. Moreover the asynchronous design of the protocol makes AlephBFT especially robust and resistant against practical network issues that classical partially synchronous protocols might have troubles with. Asynchronous liveness is an important theoretical property and there is a lot of technical sophistication that comes in the design of the protocol in order to achieve it, however on the practical side there is still little evidence that performing such attacks against liveness in real-world scenarios is possible.
- Still, no matter how unlikely such attacks might be, we take them very seriously and plan to add randomness to AlephBFT in one of the future releases. We decided to go for a version without randomness first, as it gives an incredibly simple and at the same time secure and robust BFT consensus protocol. Adding randomness introduces some complexity into the protocol, so it makes sense to add it on top of a well-tested, working product. The API of the protocol will not change and we will make the use of randomness configurable.
2.5 Alerts -- Dealing with Fork Spam.
We note that the OrderData
algorithm as described in previous subsections is resistant to forks, meaning that as long as there are at most f
forking nodes, then the resulting consensus protocol is still safe.
That being said, there are practical issues when it comes to dealing with huge amounts of forks. These are extensively described in the paper in the section on a "Fork Bomb Attack" -- we encourage the reader to take a look. We describe here a mitigation to this issue that is implemented in AlephBFT. It is based on the idea of "Alerts", described in the paper, but significantly simplified and made viable for implementation.
The main idea here is that we cannot afford to store an arbitrary amount of units that are forks -- most of them are useless, but without adding a specific mechanism to the protocol, there is no way to discard them. That's why we introduce Fork Alerts. A fork alert is a message
ForkAlert(sender, forker, proof, units)
where sender
is the node index of the sender, forker
is the node index of a forking node, proof
is a pair of units created by forker that constitute a fork and units
is simply some list of forker's units. Such a message is brodcast by sender
to all other nodes:
- The broadcast is performed using a Reliable Broadcast primitive. Roughly speaking, this means that the node does not simply multicast this message to all the nodes but runs a specialized protocol that guarantees that all the nodes receive this message (if the sender is honest) and receive the exact same version of this message (even if the sender is dishonest). We give more details on the AlephBFT's implementation of reliable broadcast in the Reliable Broadcast section.
- The intuitive meaning of such an alert is that: "I,
sender
, am aware thatforker
forked and I attach aproof
of that. Moreover, before I realized this fact, I accepted the following list of units fromforker
to my local copy of the Dag. Please accept these units as well." - Each sender can send at most one alert regarding a specific
forker
. So in the worst case, sender would send roughlyf
alerts. - Each node, right after it realizes about a new node forking, sends its own alert right away, attaching a list of units by forker that it added to its Dag.
- The
units
list cannot contain any forks and must consist of valid units (with correct signatures etc.).
Having this mechanism in place we define the (semi-formal) notion of legit units as being the ones that are: either from a non-forking node, or were attached to one or more alerts (this notion is not entirely formal because it depends on the timing and on the view of the particular node, but still it carries some useful intuitions). We keep an invariant that
Each honest node adds to its Dag only legit units.
This ensures that there is a reasonable upper bound on the number of units a single node must hold and thus guarantees that AlephBFT will not run out of memory.
3 API of AlephBFT
3.1 Required Trait Implementations.
3.1.1 DataProvider & FinalizationHandler.
The DataProvider trait is an abstraction for a component that provides data items. DataProvider
is parametrized with a Data
generic type representing the type of items we would like to order. Below we give examples of what these might be.
#![allow(unused)] fn main() { pub trait DataProvider { type Output: Data; async fn get_data(&mut self) -> Option<Self::Output>; } }
AlephBFT internally calls get_data()
whenever a new unit is created and data needs to be placed inside. If no data is currently available, the method should return None
immediately to prevent halting unit creation.
The FinalizationHandler trait is an abstraction for a component that should handle finalized items. Same as DataProvider
is parametrized with a Data
generic type.
#![allow(unused)] fn main() { pub trait FinalizationHandler<Data> { fn data_finalized(&mut self, data: Data); } }
Calls to function data_finalized
represent the order of the units that AlephBFT produced and that hold some data.
3.1.2 Network.
The Network trait defines the functionality we expect the network layer to satisfy and is quite straightforward:
#![allow(unused)] fn main() { pub trait Network<H: Hasher, D: Data, S: Encode + Decode>: Send { fn send(&self, data: NetworkData<H, D, S>, recipient: Recipient); async fn next_event(&mut self) -> Option<NetworkData<H, D, S>>; } }
Here NetworkData
is a type representing possible network messages for the AlephBFT protocol. For the purpose of implementing the Network trait what matters the most is that they implement the Encode
and Decode
traits, i.e., allow for serialization/deserialization thus can be treated as byte arrays if that is more convenient. The Recipient
represents who should receive the message, either everyone or a node with a specific index:
#![allow(unused)] fn main() { pub enum Recipient { Everyone, Node(NodeIndex), } }
Additionally NetworkData
implements a included_data
method which returns all the Data
that might end up ordered as a result of this message being passed to AlephBFT. The implementation of Network
should ensure that the user system is ready to have that Data
be ordered. In the case of Data
only representing actual data being ordered (e.g. hashes of blocks of transactions), this means ensuring data availability before passing the messages on.
The send
method has straightforward semantics: sending a message to a single or to all the nodes. next_event
is an asynchronous method for receiving messages from other nodes.
Note on Rate Control: it is assumed that Network implements a rate control mechanism guaranteeing that no node is allowed to spam messages without limits. We do not specify details yet, but in future releases we plan to publish recommended upper bounds for the amounts of bandwidth and number of messages allowed per node per a unit of time. These bounds must be carefully crafted based upon the number of nodes N
and the configured delays between subsequent Dag rounds, so that at the same time spammers are cut off but honest nodes are able function correctly within these bounds.
Note on Network Reliability: it is not assumed that each message that AlephBFT orders to send reaches its intended recipient, there are some built-in reliability mechanisms within AlephBFT that will automatically detect certain failures and resend messages as needed. Clearly, the less reliable the network is, the worse the performarmence of AlephBFT will be (generally slower to produce output). Also, not surprisingly if the percentage of dropped messages is too high AlephBFT might stop making progress, but from what we observe in tests, this happens only when the reliability is extremely bad, i.e., drops below 50% (which means there is some significant issue with the network).
3.1.3 Keychain.
The Keychain
trait is an abstraction for digitally signing arbitrary data and verifying signatures created by other nodes.
#![allow(unused)] fn main() { pub trait Keychain: Index + Clone + Send + Sync + 'static { type Signature: Signature; fn sign(&self, msg: &[u8]) -> Self::Signature; fn verify(&self, msg: &[u8], sgn: &Self::Signature, index: NodeIndex) -> bool; } }
A typical implementation of Keychain would be a collection of N
public keys, an index i
and a single private key corresponding to the public key number i
. The meaning of sign
is then to produce a signature using the given private key, and verify(msg, s, j)
is to verify whether the signature s
under the message msg
is correct with respect to the public key of the j
th node.
3.1.4 Read & Write – recovering mid session crashes
The std::io::Write
and std::io::Read
traits are used for creating backups of Units created in a session. This is a part of crash recovery. Units created are needed for member to recover after crash during a session for Aleph to be BFT. This means that user needs to provide two traits std::io::Write
and std::io::Read
that are used for storing and reading Unit that are created by member. At first (without any crash) std::io::Read
should return nothing. After crash it should contain all data that was stored before in this session.
These traits are optional. If you do not want to recover crashes mid session or your session handling ensures AlephBFT will not run in the same session twice you can pass NOOP implementation here.
std::io::Write
should provide a way of writing data generated during session which should be backed up. flush
method should block until the written data is backed up.
std::io::Read
should provide a way of retreiving backups of all data generated during session by this member in case of crash. std::io::Read
should have a copy of all data so that writing to std::io::Write
has no effect on reading.
3.2 Examples
While the implementations of Keychain
, std::io::Write
, std::io::Read
and Network
are pretty much universal, the implementation of DataProvider
and FinalizationHandler
depends on the specific application. We consider two examples here.
3.2.1 Blockchain Finality Gadget.
Consider a blockchain that does not have an intrinsic finality mechanism, so for instance it might be a PoW chain with probabilistic finality or a chain based on PoS that uses some probabilistic or round-robin block proposal mechanism. Each of the N
nodes in the network has its own view of the chain and at certain moments in time there might be conflicts on what the given nodes consider as the "tip" of the blockchain (because of possible forks or network delays). We would like to design a finality mechanism based on AlephBFT that will allow the nodes to have agreement on what the "tip" is. Towards this end we just need to implement a suitable DataProvider
object and filtering of network messages for AlephBFT.
For concreteness, suppose each node holds the genesis block B[0]
and its hash hash(B[0])
and treats it as the highest finalized block so far.
We start by defining the Data
type: this will be simply Hash
representing the block hash in our blockchain. Subsequently, we implement get_data
(note that we use pseudocode instead of Rust, for clarity)
def get_data():
let B be the tip of the longest chain extending from the most recently finalized block
return Some(hash(B))
This is simply the hash of the block the node thinks is the current "tip".
Using the included_data
method of NetworkData
we can filter incoming network messages in our implementation of Network
. The code handling incoming network messages could be
def handle_incoming_message(M):
let hashes = M.included_data()
if we have all the blocks referred to by hashes:
if we have all the ancestors of these blocks:
add M to ready_messages
return
add M with it's required hashes to waiting_messages
request all the blocks we don't have from random nodes
We note that availability of a block alone is not quite enough for this case as, for instance, if the parent of a block is missing from out storage, we cannot really think of the block as available because we cannot finalize it.
When we get a new block we can check all the messages in wating_messages
and move the ones that now have all the hashes satisfied into ready_messages
.
def handle_incoming_block(B):
block_hash = hash(B)
for M in waiting_messages:
if M depends on block_hash:
remove the dependency
if we don't have an ancestor B' of B:
add a dependency on hash(B') and request it from random nodes
if M has no more dependencies:
move M to ready_messages
The next
method of Network
simply returns messages from ready_messages
.
Now we can implement handling block finalization by implementing trait FinalizationHandler
.
def data_finalized(block_hash):
let finalized = the highest finalized block so far
// We have this block in local storage by the above filtering.
let B be the block such that hash(B) == block_hash
if finalized is an ancestor of B:
finalize all the blocks on the path from finalized to B
Since (because of AlephBFT's guarantees) all the nodes locally observe hashes in the same order, the above implies that all the nodes will finalize the same blocks, with the only difference being possibly a slight difference in timing.
3.2.2 State Machine Replication (Standalone Blockchain).
Suppose the set of N
nodes would like to implement State Machine Replication, so roughly speaking, a blockchain. Each of the nodes keeps a local transaction pool: transactions it received from users, and the goal is to keep producing blocks with transactions, or in other words produce a linear ordering of these transactions. As previously, we demonstrate how one should go about implementing the DataProvider
and FinalizationHandler
objects for this application.
First of all, Data
in this case is Vec<Transaction>
, i.e., a list of transactions.
def get_data():
tx_list = []
while tx_pool.not_empty() and tx_list.len() < 100:
tx = tx_pool.pop()
tx_list.append(tx)
return Some(tx_list)
We simply fetch at most 100 transactions from the local pool and return such a list of transactions.
def data_finalized(tx_list):
let k be the number of the previous block
remove duplicated from tx_list
remove from tx_list all transactions that appeared in blocks B[0], B[1], ..., B[k]
form block B[k+1] using tx_list as its content
add B[k+1] to the blockchain
Whenever a new batch is received we simply create a new block out of the batch's content.
When it comes to availability, in this case Data
is not a cryptographic fingerprint of data, but the data itself, so no filtering of network messages is necessary. However, they can be inspected to precompute some operations as an optimization.
3.3 Guarantees of AlephBFT.
Let round_delay
be the average delay between two consecutive rounds in the Dag that can be configured in AlephBFT (default value: 0.5 sec). Under the assumption that there are at most floor(N/3)
dishonest nodes in the committee and the network behaves reasonably well (we do not specify the details here, but roughly speaking, a weak form of partial synchrony is required) AlephBFT guarantees that:
- Each honest node will make progress in producing to the
out
stream at a pace of roughly1
ordered batch perround_delay
seconds (by default, two batches per second). - For honest nodes that are not "falling behind" significantly (because of network delays or other issues) it is guaranteed that the data items they input in the protocol (from their local
DataProvider
object) will haveFinalizationHandler::data_finalized
called on them with a delay of roughly~round_delay*4
from the time of inputting it. It is hard to define what "falling behind" exactly means, but think of a situation where a node's roundr
unit is always arriving much later then the expected time for roundr
to start. When a node is falling behind from time to time, then there is no issue and its data will be still included in the output stream, however if this problem is chronic, then this node's data might not find its way into the output stream at all. If something like that happens, it most likely means that theround_delay
is configured too aggresively and one should consider extending the delay.
We note that the issue of an honest node's data not being included in the stream is not too dangerous for most of the applications. For instance, for the two example scenarios:
- For the finality gadget example, most honest nodes see the same blocks and there is a high level of redundancy, so it does not quite matter that some of the nodes are possibly censored.
- For the state machine replication example one must assume that there is some redundancy in the way transactions are distributed among nodes. For instance if there is a gurantee (that one can easily achieve by randomly gossiping each transaction to a smal random subset of nodes) that each transaction appears in the pools of at least
5
nodes, then the issue of censorship essentially goes away.
Still, the most important thing to take away from this section is that if censorship happens, then the configuration of AlephBFT is suboptimal and one should increase the round_delay
.
3.3.1 What happens when not enough nodes are honest.
If there are less than floor(2/3N)+1
nodes that are online and are honestly following the protocol rules then certain failures might happen. There are two possibilities:
- Stall -- the output streams of nodes stop producing data items. This is also what will happen when the nodes are generally honest, but there is either a significant network partition or lots of nodes crash. If this is not caused by malicious behavior but network issues, the protocol will recover by itself and eventually resume its normal execution.
- Inconsistent Output -- this is the most extreme failure that can happen and can only be a result of malicious behavior of a significant fraction of all the nodes. It means that the honest nodes' output streams stop being consistent. In practice for this to happen the adversary must control lots of nodes, i.e., around
(2/3)N
. The type of failure that would usually happen if the adversary controls barely abovefloor(1/3N)+1
is stall.
3.4 AlephBFT Sessions.
Currently the API of AlephBFT allows to run a single Session that is expected to last a fixed number of rounds and thus to finalize a fixed number of output batches. By default a AlephBFT Session is 5000
rounds long but out of these 5000
there are 3000
rounds that the protocol proceeds at a regular speed (i.e., 500ms
per round) and after that starts to slow down (each round is 1.005
times slower than the previous one) so that round 5000
is virtually impossible to reach.
There are essentially two ways to use AlephBFT:
- Single Session -- just run a single session to make consensus regarding some specific one-time question. In this case one can run the default configuration and just terminate the protocol once the answer is in the output stream.
- Multiple Sessions -- a mode of execution when AlephBFT is run several times sequentially. An important motivating example is the use of AlephBFT as a finality gadget for a blockchain. Think of session
k
as being responsible for finalizing blocks at heights[100k, 100(k+1)-1]
. There should be then an external mechanism to run a new AlephBFT session when the last block of a session gets finalized (and stop inactive sessions as well). This example gives a clear answer for why we opted for the slowdown after round3000
as explained above: this is to make sure that no matter the variance in block-time of the block production mechanism, and no matter whether there are stalls, network issues, crashes, etc it is guaranteed that the prespecified segment of blocks is guaranteed to be finalized in a given session. Readers who are experienced with consensus engines are surely aware of how problematic it would be if at the end of a session, say, only70
out of the intended100
blocks would be finalized. That's why it's better to slow down consensus but make sure it achieves the intended goal.
Why are there even sessions in AlephBFT? To answer this question one would need to make a deep dive into the internal workings of AlephBFT, but a high level summary is: we want to make AlephBFT blazing fast, hence we need to keep everything in RAM (no disk), hence we need to have a round limit, hence we need sessions. For every "hence" in the previous sentence there are extensive arguments to back it, but they are perhaps beyond the scope of this document. We are aware of the inconvenience that it brings -- being forced to implement a session manager, but:
- We feel that depending on the application there might be different ways to deal with sessions and its better if we leave the task of session managing to the user.
- In one of the future releases we plan to add an optional default session manager, but will still encourage the user to implement a custom one for a particular use-case.
4 Differences between the implementation and the paper.
There are several differences between the Aleph as described in the paper and the version implemented in AlephBFT. Many of them are already described in previous sections but for completeness we briefly list the differences here.
- The main version of Aleph uses Reliable Broadcast to disseminate units. The AlephBFT implementation is closer to QuickAleph (in the Appendix of the paper) that uses Reliable Broadcast only for Alerts.
- The specifics of alerts are different in the AlephBFT implementation -- in particular they do not require to freeze the protocol at any moment and are generally simpler.
- AlephBFT uses its own variant of Reliable Broadcast -- see the section Reliable Broadcast.
- Differences in the use of randomness -- see Randomness in AlephBFT.
- The parent choice rule for units in the paper is different: it allows parents from older rounds as well, not only the previous round. The reason why the paper allows older rounds is to guarantee a theoretical censorship resistance property in the asynchronous setting. In practice this is irrelevant and getting rid of that makes the protocol simpler and more efficient.
- The main version in the paper uses a full list of parent hashes instead of control hashes -- the latter is described in the Appendix as an optimization.
- The paper's Appendix proposes the use of random gossip as a method of disseminating units -- AlephBFT uses repeated broadcast + a request/response mechanism instead, which according to our experience performs much better in practice.
5 AlephBFT Internals
To explain the inner workings of AlephBFT it is instructive to follow the path of a unit: from the very start when it is created to the moment when its round is decided and it's data is placed in one of the output batches. Here we give a brief overview and subsequently go more into details of specific components in dedicated subsections.
- The unit is created by one of the node's
Creator
component -- implemented increation/
. Creator sends the produced unit torunway/
, which then sends it tomember.rs
. - A recurring task of broadcasting this unit is put in the task queue. The unit will be broadcast to all other nodes a few times (with some delays in between).
- The unit is received by another node -- happens in
member.rs
and immediately send torunway/
for further processing indag/
. - Dag validates and reconstructs a unit's parents in several steps:
- Validation, implemented in
dag/validation.rs
, checks signatures and basic unit properties, plus catches forks. This means that only legit units, in the sense defined in the section on alerts, are sent further. Thus no fork is ever passed on unless coming from an alert. - The units are further moved to a component responsible for reconstructing the explicit parents for these units -- implemented in
dag/reconstruction/parents.rs
. - Each unit whose parents are successfully decoded, is passed on to
dag/reconstruction/dag.rs
, which ensures that units are passed on only when their parents already were. They are then returned back torunway/
. - In
runway/
such units are put in a store. Each unit in the store is legit + has all its parents in the store. - Such units are passed to a component called the
Extender
-- see the files inextension/
. The role of the extender is to efficiently run theOrderData
algorithm, described in the section on AlephBFT. - Once a unit's data is placed in one of batches by the
Extender
then its path is over, although we keep it in the runway store to be able to send it to other nodes on request.
5.1 Creator
The creator produces units according to the AlephBFT protocol rules. It will wait until the prespecified delay has passed and attempt to create a unit using a maximal number of parents. If it is not possible yet, it will wait till the first moment enough parents are available. After creating the last unit, the creator stops producing new ones, although this is never expected to happen during correct execution.
5.2 Dag
The dag receives units from the network and returns any that were successfully reconstructed with parents. It does that in several steps, starting with validation.
5.2.1 Validation
The validation process consists of checking basic properties of units (correct number of parents, correct session etc.), the signatures, and whether the unit is a fork based on the units that the node either already has or at least started processing. As mentioned, the idea is that only legit units are passed to the reconstructing component. In case a fork by a node i
is detected, all of i
's units are attached to the appropriate alert, so that other nodes can accept them as legit.
The next step is to reconstruct the structure of the Dag from the somewhat compressed information in the units.
5.2.2 Parents
The reconstruction service receives legit units, but the information about their parents is only present as a control hash, i.e. which nodes created the parents and what was the combined hash of all the parents' hashes. Parents reconstruction remembers the first unit for any creator-round combination it encounters and optimistically uses this information to check the combined hash. If there are no dishonest nodes, which is the usual situation, then this means that every unit might at most have some parents that cannot yet be checked, because the node has not yet received them. In such a case requests for these missing units are sent to Member
. After the units are received, the control hash check succeeds and thus the parents are reconstructed successfully.
If dishonest nodes participate in the protocol, then two additional things can go wrong:
- either the unit has one or multiple parents that are forks, with variants different from the first ones received by this node to be precise. The reconstructing service might or might not have access to the correct variants, but in either case it does not attempt to perform the naive check on different variants -- guessing the correct variants might require exponential time so there is no point to even try it,
- or the unit's creator is dishonest and just put a control hash in the unit that does not "unhash" to anything meaningful.
In any case the reconstruction triggers a request to Member
to download the full list of the unit's parent hashes, so that the ambiguity is resolved. Once a response is received by Member
then it is passed back to the reconstruction so that it can "decode" the parents and proceed.
5.2.3 Dag
The units parents might, for many reasons, not be reconstructed in an order agreeing with the Dag order, i.e. some of their ancestors might not yet be reconstructed. The Dag component ensures that units are only added to the store after their parents were already added, and thus any units emitted by the Dag component are in an order agreeing with the Dag order.
5.3 Extender
The Extender
's role is to receive Dag units (in an order agreeing with the Dag order) and extend the output stream. Towards this end it elects the Head
for each round
. Such an election works by going through candidate units from this round either eliminating them or eventually electing one. Votes are computed and cached for each candidate until a decision on it is made, after which the election moves on to the next round (if elected as Head
) or to the next unit (otherwise). After electing every Head
the Extender
deterministically orders all its unordered ancestors and the Head
itself and returns the resulting batch.
6 Reliable Broadcast
Recall that Reliable Broadcast is the primitive we use to broadcast fork alerts
among nodes -- see the section on alerts. There are two requirements for a protocol realizing reliable broadcast:
- If an honest sender initiates a reliable broadcast with some message
m
then the protocol terminates and as a result all honest nodes receivem
. - If a malicious sender initiates a reliable broadcast then either it terminates for all honest nodes and they receive the same message
m
, or it does not terminate for any honest node.
So, roughly speaking, we want a broadcast primitive that is consistent even if a malicious node is the sender. There is the possibility that a malicious broadcast will not terminate, but it is not hard to see that this is the best we can hope for.
6.1 Consistent Broadcast using multisignatures -- RMC
The main idea behind the reliable broadcast implementation in AlephBFT is the use of multisignatures. Without going too much into details, think of a multisignature over a message m
as a list of N-f
signatures by N-f
different committee nodes over the same message m
(more efficient ways to achieve such a functionality are possible, like threshold signatures or signature aggregates, but they are beyond the scope of this section). Someone holding a multisignature over m
can be sure that a large fraction of nodes "agree" (with the meaning of "agree" depending on the particular application) on m
.
The RMC (Reliable MultiCast) protocol is a way to reliably disseminate a single hash h
among all nodes (in the next section we explain how to extend it to disseminating arbitrary data and not only a hash). The idea is as follows:
- Whenever a node
i
wishes to disseminate a hashh
it initiates a reliable multicast instance by signingh
and sending such a signed hashSIG(h, i, sig_i)
to all other nodes. - Upon receiving such a signed hash, each node
j
signsh
and sends its signed hash:SIG(h, j, sig_j)
to all other nodes. - Each node keeps receiving signatures under
h
from different nodes. Upon receivingN-f
of them, this node combines the signatures into a single multisignaturemsig
and sends to all nodes a messageMULTISIG(h, msig)
. - Upon receiving
MULTISIG(h, msig)
underh
, each node passes it also to all other nodes.
The moment when a node receives MULTISIG(h, msig)
is considered as the completion of the multicast for this node (and even though the node still keeps resubmitting messages) this instance of RMC is considered as successful. If a RMC succeeds for some honest node then it is guaranteed to succeed for all honest nodes (but maybe with some delay). We refer to the file /src/rmc.rs
for a thorough documentation of this component.
6.2 Reliable Broadcast based on RMC
Having the idea of RMC, one can modify it quite easily to achieve reliable broadcast. A naive way to do so would be to let the sender node hash the message m
it intends to reliably broadcast into h=hash(m)
and use RMC on the hash h
. This almost works, except for the data availability problem -- a malicious sender might simply send a random meaningless hash h
and then the honest nodes would never be able to recover the underlying data.
To circumvent the data availability problem we instruct the sender to send data m
to all the nodes and only then to initiate RMC on h = hash(m)
, if we make sure that no honest node proceeds with RMC before it receives the data m
, then a successful RMC has the guarantee that most of the honest nodes hold the data m
. This is the basic idea behind the protocol implemented for fork alerts in AlephBFT, we refer to /src/alerts.rs
for details.