- cross-posted to:
- lemmy@lemmy.ml
- cross-posted to:
- lemmy@lemmy.ml
https://github.com/LemmyNet/lemmy/issues/3245
I posted far more details on the issue then I am putting here-
But, just to bring some math in- with the current full-mesh federation model, assuming 10,000 instances-
That will require nearly 50 million connections.
Each comment. Each vote. Each post, will have to be sent 50 million seperate times.
In the purposed hub-spoke model, We can reduce that by over 99%, so that each post/vote/comment/etc, only has to be sent 10,000 times (plus n*(n-1)/2 times, where n = number of hub servers).
The current full mesh architecture will not scale. I predict, exponential growth will continue to occur.
Let’s work on a solution to this problem together.
You got a lot of heat in this discussion, but let me be one of the few to applaud you for actually making a proposal. Saying No is easy, but suggesting something and writing it down and putting it out there is hard.
I am a Principal Engineer by trade, and i do what you did here all the time. I put out suggestions to my team and let them absolutely wreck it. This is how you advance and enhance your idea. Listen and learn from the feedback and suggest another thing based on what you have learned. Rinse and repeat.
That’s how you get to a great proposal. Keep at it. Well done.
I put out suggestions to my team and let them absolutely wreck it.
I know the feeling- I am used to it. My day job is being a combination consultant, project manager. (With some software dev, every now and then). I get to sit down and help design and architect things, and solve problems. There generally isn’t a solution everyone likes or agrees with, but, if you can check off more issues than you cause- it’s generally a step in the right direction.
People are absolutely stomping on the idea for the most part, but, I do think a few good points have came out of the discussion.
And- a few good points, is better than no points at all!
And at the very least, there’s a record of the discussions and thought processes behind why this was or wasn’t chosen.
But, just to bring some math in- with the current full-mesh federation model, assuming 10,000 instances-
That will require nearly 50 million connections.
Each comment. Each vote. Each post, will have to be sent 50 million seperate times.
Well your whole premise is just utterly wrong.
The way federation actually works:
A user on lemmy.ml subscribes to a community on lemmy.world. Say, !funny@lemmy.world
Assume that this user is the first lemmy.ml user to do so - basically what happens is the lemmy.world community sees that a member of a never before seen instance just subscribed. !funny@lemmy.world then adds lemmy.ml to its list of instances it needs to tell whenever something happens in the community.
No matter how many users of lemmy.ml subscribe, this only happens once.
Now when a user of sh.itjust.works upvotes a post on !funny@lemmy.world, the sh.itjust.works instance then tells !funny@lemmy.world of this change. It accepts the change, then tells everyone on its list of instances that have subscribers on them.
So essentially, sh.itjust.works talks to lemmy.world, lemmy.world tells everyone else. There is no “full mesh”. The instance hosting the community is the “hub”, everything else is a spoke.
So if there’s 10,000 instances, and they all just so happen to have at least one subscriber to some community, each change will be sent out 9,999 times. Your “50 million” premise is just completely wrong and I’m not sure where it’s coming from.
Its not wrong- we just have opposite ideas here-
The 50 million, is based on the formula for a full-mesh network. Where all instances talk to each other. In the case of lemmy, this would be an absolute worst-case scenario, where every instance, is subscribed to a community on every other instance.
In your example of only 10,000 messages, you are assuming that of the 10,000 instances in existence, they are ONLY looking at a single community, on a single server.
Lets say, those 10,000 instances all decide to look at a community on another server. Now you have 20,000 connections.
Lets add another community, hosted on yet another instance. That is 30,000 connections.
TLDR;
My example, is based on worst-case scenario. (A pretty unachievable one at that!)
Your example, is based on best-case scenario.
Realistically, the actual outcome would be somewhere much closer to best-case scenario(As communities seem to lump up on the big servers). However, for planning architecture, you always assume worse-case scenario.
No - you said:
Each comment. Each vote. Each post, will have to be sent 50 million seperate times.
That won’t ever happen. Unless there’s 50 million instances. That’s not worst case, it’s just not a case.
There is no case in the current implementation where any one action is replicated more times than there are total instances.
And it doesn’t matter what “model” you assume, each action will have to federate to each instance eventually. That count is minimally, the total number of instances.
Lets say, those 10,000 instances all decide to look at a community on another server. Now you have 20,000 connections.
Looking does nothing, each instance hosts essentially a copy of the “host instance” for each community. Only interactions (comments, likes, posts, etc) are federated.
for fucks sake, dude, be collaborative, and not defensive. This isn’t reddit, I am not out to attack your karma.
If every instance, hosts a community, and Every other instance, subscribes to every one of those communities, that would lead to a full-mesh between all instances, resulting in worst-case scenario, ie, following the formula I provided for a full-mesh topology.
That is indeed, the worst case scenario, I have provided, explained, and documented in my examples.
In no way is the person you’re responding to speaking defensively. They’ve discussed the reason why your extrapolation to a full-mesh connective worst-case scenario isn’t based in the reality of how ActivityPub functions. But you don’t seem to be willing to entertain the notion that the federation of any given action never exceeds the number of instances subscribed to the community that generated it.
Even should every instance subscribe to every community on every other instance, the recipient of a federated action doesn’t turn around and rebroadcast that action back on to the network because it is not the authoritative host of that community. Therefore what this discussion is lacking is proof of where this exponential broadcast storm of federated actions comes from in your assertion.
Yes, it is a “full mesh” diagram. But for each specific “federated” action, it is a simple hub and spoke distribution. The hosting server will send the federated action to each subscribed node. The nodes don’t need to check in with each other for that specific action.
I too believe that Federation is going to have scaling issues. But not due to full mesh
I am onboard with you there-
But, would not not agree- delegating and offloading those federation actions to a dedicated pool of servers, would not assist scalability?
That way- each instance doesn’t need to maintain all of the connections?
There is no need to “maintain all of the connections”. The server opens a connection, sends the data, then closes the connection.
I realize that…
Let’s- set the record straight here.
Do you think the current implementation of federation works well?
Apologies if I came off as hostile.
I mean I get what you’re saying - I just don’t see the practical use. The centralized hub replication servers would have to basically foot a huge bill for the fediverse, and do so silently and invisibly to the end user. As it is, most instances run on goodwill or donations. A silent, invisible server is hard to gather donations for. Who would run them?
Furthermore the topology you propose is essentially what we already have. A few large instances hold most of the largest communities. I don’t see that changing. This brings a fairly good balance - smaller instances pretty much only have to listen for updates from a few other instances, only the big instances are doing the hard work of notifying hundreds of others. They are already our “hubs”. Small instances really hardly do practically any hard work, the one I run for example just listens to maybe a dozen instances send updates, and occasionally sends out an update when one of my users interacts.
I suppose I just don’t understand how this could be implemented in practice- or rather how it could be useful to do so. It would strictly enforce a sort of centralization that right now is only a natural consequence of user behavior, while seemingly only bringing theoretical benefits.
The centralized hub replication servers would have to basically foot a huge bill for the fediverse, and do so silently and invisibly to the end user.
One consideration, since they are only having to basically sub/pub - the load actually might be drastically lower than expected.
Furthermore the topology you propose is essentially what we already have. A few large instances hold most of the largest communities. I don’t see that changing.
Suppose- that is a valid point. The issue though- those large instances are unable to keep up with demand and load, causing lots of federation issues.
Perhaps, my idea actually wouldn’t help that at all, but, using lemmy.ml as an example-
Instead of it having to send all of its updates out to every server subscribed- it can delegate that to a hub server to do it. The hub server can run a very minimal set of instructions, with enough intelligence to handle sub/pub.
Perhaps- one idea is, instead of thinking of it as a hub-server, think of it as a proxy server. Being able to delegate your instances actions to the proxy server to reduce that load from the main server.
And, instead of the hubs/proxies being more centralized, perhaps, its just an optional thing which you CAN do.
My line of thinking, is methods to reduce load from the main servers. This might be an idea that only benefits the handful of big servers.
I am not certain on scenarios you were mentioning above, but I do agree that separating software to instance plus hub/proxym/mssage queue could help with handling load.
How can we scale our big i instances? I don’t know maybe it is easy to put instance on multiple servers, but sounds to me they are just buying bigger one, and that will fill up fast of growth continues to happen.
I would like to hear from developers what they think, but thank you for starting conversation about scaling.
Activities aren’t sent on every “connection” in the network in the current model. There isn’t indirect transmission nor polling so even though there’s a theoretical 50 mil connections in the scenario you gave, any one activity will already only be sent up to 10k times. That’s why instances require TLS and being internet accessible, so they can receive direct communication. I agree with you that there’s some difficult scaling issues with federation but your representation of it is inaccurate.
The same problem can also be solved with signed messages, like the HTTP Signatures used by Mastodon and most of the other microblogging fedi servers. Signatures allow a message to flow peer-to-peer instead of requiring a direct connection. You would only need a connection when actively interacting with a post on another instance, and its very unlikely that all 10K instances would be interacting with each other. Most likely, the network will consist of smallish groups of loosely-related instances plus a few giant servers that can handle the load of being popular.
That, honestly, wouldn’t be a bad idea either. That should in theory help break up a lot of the load which is currently overly centralized.
The implementation should be a lot easier then my purposed idea as well, and it also has side effects of potentially improving security.
Other people in the thread have already made this point: even with a full mesh network, the number of remote calls made for a single activity is equal to the number of instances subscribing to that activity (plus one if the activity originates from an instance that’s not the host of the activity).
A hub/spoke model doesn’t change this, it just moves the load from the host instance to the hub. The number of connections is still the same: if N instances need to receive the activity, N calls will have to be made. If anything this adds 1 more call from the host instance to the hub.
Even peer-to-peer distribution of activities, mentioned by @hazelnoot@beehaw.org, wouldn’t actually change the amount of calls being made. You still have N servers that have to receive the activity, so you need at least N calls overall. What this would do is redistribute the load better over instances, so the host doesn’t have to make all N calls. It would definitely be an improvement, but it would not be easy to implement successfully, and it would almost surely break ActivityPub compatibility.
The only thing I can think of that would actually reduce the overall network load, though, is batching: sending multiple activities/updates together in a single message. AFAIK this is not supported by ActivityPub, though, so implementing it would mean breaking compatibility, and also implementing an entirely updated version of the protocol (which is a massive undertaking).
My logic, was the move the load away from the primary instance server, onto a service/server that only focuses on handling federation duties.
My reasoning- is to break apart the two workloads, and hopefully build a more scalable federation tier, that can scale independently on the primary instance server.
I understand the logic, and you’re right to think about how improve Lemmy’s scalability. But I’m not sure if this is the way to go.
If you build a dedicated federation proxy for an instance, you’ve really just slightly moved the problem. The federation proxy is going to have the same scalability issues, and if anything the total load goes up.
If you build multi-instance hubs, you suddenly introduce a lot of new issues.
- Security: I think Lemmy checks the source of an update to verify that it comes from the legitimate host. You would have to introduce some kind of signatures to verify that the activity originated from the legitimate host.
- Privacy: now your users have to trust the hub owners with their data, not just the instance.
- Motive: who would be running the hubs, and why? They would have to be even bigger that the instances, and there would be much less incentive to do it.
Nice but it kinda breaks the point of federation - who’s running the hubs? If it’s a company, then nice, we’re back to appeasing our corporate overlord. I
If it’s a secondary federated system, nice now you’re just needlessly complicating things, as anyone could create a hub - you could end up with a lot of single spoked hubs.
Load wise, having the hub separated to own server would make scaling easier. So even one hub and one instance solution for large instances could work. For personal instances this solution would be nice, because they could share one hub, and federate through that.
No one is suggesting here to have any company host the hub.
I know no one is suggesting that - that’s not what I was saying.
But it also would cost more, and also someone’s got to host it and who is that person? Or is it a collective decision to contribute towards hosting the hub?
Who gets the rights - whoever has control of the hub, naturally has direct control over their part of the threadiverse.
While it adds additional load bearing capabilities, it also adds another point of failure for those sites, and potentially even sabotage if disagreement happens.
After all, someone’s gotta put their name down to hire or buy server capacity for it, even if everyone pays for it and that someone has full control over what goes through the hub.
Unless we encrypt all data so that the hub is just a dumb relay, that’s not going to work. And even then - you can still tell where the messages come from and slow them or block them at the hub.
The point of the fediverse is to make it so regular Joe’s can afford to run servers on just their own income or donations, so we can take the corporate out of it. We don’t want to add tol much additional costs - and adding a hub system will do that.
Is this accurate on how it works? My assumption was a user would have to be subscribed to a remote community on their local instance for that local instance to pull posts/votes/comments from the remote instance. It’s not like everything is replicated everywhere.
Your assumption is correct-
I gave worst-case scenario for modeling purposes.
Realistically, the number of connections will be far less, however, do also note, this platform will soon be hosting over one million users. Everything, is going to scale upwards.
In your proposal, who would run these hub servers?
Theoretically, anyone could run a hub server.
A hub server, would work just like an instance does in the current state, and to keep things decentralized, I would recommend it stay that way.
However, I do believe some controls would be needed to prevent EVERYONE from creating their own hub server, leading to the current issue of instantly large federation loads.
Perhaps, just a simple check to prevent a hub from running with less than adequate resources. Or, perhaps, we as a community, can collectively decide who/where/what the hub servers are.
I don’t have all of the answers to my question/proposal- I just know the full mesh topology is NOT going to scale, and we do need to work together to find a better solution to this problem.
What if each instance had a message broker distribute updates in a pub/sub topics oriented fashion? Does the activitypub spec specify that instance X must http post updates to instance Y or is there room for implementations to get creative?
Does the activitypub spec specify that instance X must http post updates to instance Y or is there room for implementations to get creative?
For my initial idea, my proposal is to allow servers to have the option to choose to go through a hub, or to function as-is.
What if each instance had a message broker distribute updates in a pub/sub topics oriented fashion?
That would help the current implementation pretty drastically- although, I am not 100% sure how it is done currently.
I’ll be the odd one out and say I support this model but for other reasons than the technical limitations and scaling problems involved. For me it’s more about trying to establish a tighter ring of trust and enable easier user onboarding as the hub could serve as the primary identity store for users on multiple instances.
I mentioned it in some chat earlier, but I think that the Beehaw.org moderation model, goals, and philiosophy serves as an excellent starting point for like-minded communities to build out the hub-and-spoke. It would also give them greater flexibility in maintaining the health of their corner of the fediverse by centralizing identity with them.
This model would, of course, not stop others from creating their own hub and spoke and would break apart the fediverse a bit, so I suppose there should be a way for “hubs” to talk to eachother in a way that resembles what we have now.
From a blocking bad actors standpoint (I’m still upset about Captcha getting removed even if it’s a technically inferior solution), it would be far easier to have fewer hubs to need to blacklist/whitelist than having to do it for each individual instance.
I guess to go a bit further, if Lemmy could support both “modes” (as in it can be configured to be hub and spoke as either the hub or spoke, as well as retain the existing functionality for those who don’t want a hub) that would be ideal.
A bit of centralized spam management wouldn’t be a bad idea at all.
To be honest, I don’t know anything about how lemmy works. However, I was wondering whether there is anything that can be learnt from the Nano crypto currency because they have worked really hard to reduce traffic and spam on their network. It’s worth looking into if you haven’t already. There is no profit to be made with nano so the community is much less crypt-bro than other cryptos.