An Open Standard For Distributed Inference In Federated Networks
EdgeLLama is an open standard designed to allow any supported machine to serve as an inference provider. It's a stateful application layer communication protocol that offers HTTP and WebSocket APIs to securely transmit inference requests in JSON format across two primary layers: the CommunityServer-Edge Communication Layer and the CommunityServer-CommunityServer Communication Layer.
The model presupposes an asynchronous distributed system. Within this system, nodes are interconnected through a network. It's acknowledged that this network might exhibit certain inefficiencies: it might not deliver messages, could delay, duplicate, or even rearrange them.
To safeguard the integrity of message transmission, we employ cryptographic measures. These measures include the integration of public-key signatures, message authentication codes, and digests generated by collision-resistant hash functions.
It is of utmost importance to ascertain that a received message genuinely emanates from a specified node, ensuring it hasn't been tampered with or fabricated by malevolent entities.
For achieving this on a Community Server, there are two potential implementations: stateful and stateless.
Upon successfully establishing a connection to a Community Server, an EdgeLlama node communicates its capabilities. This encompasses details like the types of models it can execute (HuggingFace ID), the model's revision, quantisation, CUDA support, and other pertinent characteristics. Given this information, the Community Server categorizes the EdgeLlama node, subscribing it to the relevant PubSub topic correlating with the model it operates on.
The fortification check is initiated once there have been 'n' successful inferences executed for an identical model. Subsequent to this, the Community Server focuses on maintaining the caliber of nodes associated with that topic. The Community Server prompts the EdgeLlama node with a number alpha of Inference Calls, chosen at random from the preceding 'n' successful inferences. As a reference, the Community Server retains the digests of the outputs from these inferences. Adhering to the protocol standard, the EdgeLlama node, given its subscription to the model's topic, executes the prescribed inference. Upon completion, a comparison is made between the digest of this recent inference and the stored digests within the Community Server. If there's a match, the Community Server is deemed to have successfully passed the fortification check.
The process of facilitating an inference is not merely a direct function call but a more orchestrated communication between the Community Server and the various EdgeLlama nodes.
When the Community Server receives an inbound request for inference, it primarily consists of two main components:
• Prompt: A string that delineates the information or data upon which the inference is to be executed.
• Model: A ModelConfig object that encapsulates the details of the model to be employed for the inference. The ModelConfig is expected to have specifics like the model's identifier, version, etc.
Upon receipt of the inference request, the Community Server's immediate task is to identify the set of EdgeLlama nodes that are capable of fulfilling the request. This is achieved by:
• Referencing the ModelConfig object within the request to identify the desired model.
• Consulting its internal registry to retrieve a list of EdgeLlama nodes that are subscribed to the identified model topic.
With the list of appropriate EdgeLlama nodes at its disposal, the Community Server commences the process of broadcasting the inference request. But before that, a "seed" and a "UUID" for that specific inference request is generated by the Community Server. The Community Server appends the seed and the ID to the inference request and the inference request is disseminated in parallel to a fanout of the identified EdgeLlama nodes. This ensures that multiple nodes have the opportunity to process the request, catering to redundancy and increasing the chances of a prompt response. Each EdgeLlama node, upon receiving the broadcasted request, will initiate the necessary steps to run the inference based on the provided prompt and its local instance of the model specified in the ModelConfig.
Given the deterministic execution nature of models in this specific environment, an identical outcome is expected when the same seed and model parameters are provided. The Community Server leverages this characteristic to assure the precision and uniformity of inferences.
Inference aggregation stands as a cardinal stage in the Community Server's inference process, especially in contexts endorsing distributed inference tasks spread across numerous EdgeLlama nodes.
The Community Server, by juxtaposing responses from diverse nodes, can isolate the most recurrent response, thereby amplifying the final outcome's accuracy. When certain EdgeLlama nodes encounter latency, discrepancies, or faults in their output, the strategy of dispersing the inference task across multiple nodes furnishes the Community Server with redundancy and alternative response routes. The employed aggregation methodology pivots around a simple majority vote paired with feedback loops.
Viewing an Inference Resolution digest as a vote, the Community Server Protocol calculates a trustworthiness score for each unique response digest. Under standard conditions, a singular distinct response digest is anticipated from the EdgeLlama nodes. This trust score is derived using a weighted voting approach. Here, responses from particular EdgeLlama nodes, which have been historically reliable or have met specific criteria, carry added significance. The preferred response is the one that accumulates the maximum aggregated weight.