-
Notifications
You must be signed in to change notification settings - Fork 16
Add initial ActivityPub support to starter packs #177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Oh yeah, and I forgot to mention: I'm kinda 50/50 on whether this can serve as a general API (for people who don't care about ActivityPub) or if a separate JSON API should exist to list starter packs, like you worked on in the separate branch. This version is simple to fetch and read, assuming the client can set a custom One downside of this schema is that the ActivityPub IDs are all you get. If the client wants to convert those back into account handles, that's another HTTP round trip each. Unfortunately there really isn't a good way to add the handles to this payload without massively increasing the schema's complexity. So maybe that means you ultimately want a separate non-ActivityPub API after all? Maybe we should wait and see what API consumers say when this is live. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far but I left a few comments. It would also be great if we had some tests for the wats_activitypub branch!
| # This should be resolved somewhere else by a background task, so we | ||
| # return an HTTP 500 error and leave it to the client to try again later. | ||
| return HttpResponseServerError() | ||
| items = [account.activitypub_id for account in accounts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't items be dicts like this:
{
"type": "Person",
"id": "https://example.org/users/username",
"name": "Display Name"
}
Having them as dict has the benefit that we can add new properties without making a backwards incompatible change. If we go with list of strings it won't be easy to change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ActivityPub uses JSON-LD, so serving the URI https://example.org/users/username or (some subset including id of the properties of) the object hosted at https://example.org/users/username is equivalent. Referencing the ID is just more convenient for us for two reasons:
- Since we're not the authoritative source for
https://example.org/users/username(that would be the server atexample.org), we can't really add or change properties here. They'd have to be synced to whatever exists athttps://example.org/users/username. Outdated cached values are not rare, but intentional deviations would be invalid, and a JSON-LD parser would likely strip them out. - If we rehost
https://example.org/users/username's info here, we need to keep it stored and updated. For example, we don't currently store thetype. We could add it to the database like we did with the ID, but unlike the ID it can change at any time, for example if a Mastodon user checks or unchecks the "this account is a bot" toggle. That sounds like effort. - If we say "screw the rules" and add our own info to our local version of
https://example.org/users/username, we'd still need to stay within the ActivityStreams vocabulary, or add our own resolvable JSON-LD@contextwhich sounds like even more of a pain.
I see only downsides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with ActivityPub or JSON-LD but in my experience creating an API endpoint with a list of strings is a bad idea since it's impossible to extend without creating a breaking change.
Also, I'm not sure how useful this endpoint will be if every consumer has to then also refetch all the related objects. Feels wasteful.
I'm already refreshing account data somewhat regularly and stale data hasn't been a big issue so far.
My preference would be to use a list of dicts, even if we start with:
{
"id": "https://example.org/users/username"
}
Just so that we keep the door open for future extension if it will ever be needed.
I don't feel very strongly about this though so if you believe that having a list of strings is better then let's go with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna try my best to set my implementation biases aside and answer from a perspective that mainly takes ActivityPub peers and API clients into account. I think it'll be useful for us to get a bit more shared clarity on this.
I'm not very familiar with ActivityPub or JSON-LD but in my experience creating an API endpoint with a list of strings is a bad idea since it's impossible to extend without creating a breaking change.
I don't know if "a list of strings" is a useful way to discuss this approach. In JSON-LD, the ID string and the object are quite literally interchangeable, so any consumer of this data needs to accomodate both versions anyway, including switching back and forth between IDs and inlined objects within the same list. It's almost pointless to discuss which one is more convenient for consumers because they need to be able to handle both, regardless of whether they're using a full JSON-LD parser (Fedify, Iceschrimp) or not (Mastodon, Pixelfed). That is to say, it would not be a breaking change to switch between IDs and inlined objects at a later date if we so choose. In fact, it would be a huge mistake for us to guarantee a particular shape of the data structure here.
I'm a tiny bit tempted to randomly replace one in ten IDs with its inlined object counterpart, just to keep ActivityPub consumers on their toes. But I'm not enough of a troll to do that in production software. 🙃
And regarding extensibility, we have very little room to add anything to the contained objects anyway. Because they're owned by other servers, all that we could theoretically add to them would be their properties as they exist on the source server. It's not that I can't see that being useful (e.g. to efficiently compile a list of avatar URLs directly from a starter pack), but I get the impression that you have a perspective on API extensibility here that doesn't map to ActivityPub.
Also, I'm not sure how useful this endpoint will be if every consumer has to then also refetch all the related objects. Feels wasteful.
If they want to do anything beyond trivial (e.g. check how many entries are in the list) with the starter pack, they need to fetch every listed object from the source servers anyway, since – setting unintentional inaccuracies caused by caching aside – they can't necessarily trust our list to be non-malicious. Imagine if a starter pack hosted on fedidevs.com were allowed to provide an object representation of https://example.org/users/abc and the consuming server (say, Pixelfed) just took that at face value. Any starter pack out there would be able to set your name, bio, avatar, etc. to whatever it wants. Since I started using Mastodon, it has had a couple of emergency security updates related to spots where they forgot to validate inlined objects from federated ActivityStreams data against the source instance.
The origin-based data ownership model is documented in FEP-fe34. Not a must-read, just some further info I'm linking in case you're curious.
I'm already refreshing account data somewhat regularly and stale data hasn't been a big issue so far.
That's good to know! 👍 I'll scratch that concern from my mental to do list.
With all this in mind, I'm leaning more and more towards the idea that a separate JSON API for consumers who don't care about ActivityPub may very well make sense. But that's outside the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the two formats are interchangeable then let's go with the object one. If we end up adding extra metadata into the object for non ActivityPub clients then that metadata will be ignored by ActivityPub clients since they'll be refetching from source so no harm done, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Your stance matches my gut feeling.
Having a wrapper type for entries would break compatibility with Dan's schema (pixelfed/starter-kits/issues/1), which is AFAIK implemented by Pixelfed and WordPress ActivityPub. So I don't think that's something we want to do here.
We could plop our metadata into a separate property of the root Collection, which we do own. Although at that point it becomes a little dubious again whether consumers of AS and those wanting plain JSON should really share a payload or whether we should just split them entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun times 😆
What's the type value supposed to contain and where can we get this data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could plop our metadata into a separate property of the root Collection, which we do own.
@jfietkau Another option is to offer two different collections: Pixelfed / legacy collection and your own collection that can contain additional metadata (sounds like an important feature!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of type can in theory be pretty much whatever, although in the wild I've only seen these five. It differs per account and can change over time.
Unlike the ID, the type is sadly not in the Mastodon API account lookup that we already do. The process to acquire it would be to fetch the ActivityPub ID as a URL with the correct Accept header to get the ActivityPub representation. The type field is in there and can be stored as a string.
Or we split the ActivityPub and JSON APIs. In that case we could put Mastodon's bot boolean that we already have in the database into the JSON result. I think I'd prefer to move this PR in that direction rather than abuse the ActivityPub objects too much. I'll whip something up for you to look at tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@silverpill True, that would be possible too. The new collection would need a separate ID/URI though, which is a bigger change in this implementation context than I wanted to introduce. Will mull it over.
("Pixelfed / legacy collection" – shots fired 😆)
| "id": request.build_absolute_uri("/s/" + starter_pack.slug + "/"), | ||
| "name": re.sub(r":\w+:", "", starter_pack.title).strip(), | ||
| "summary": starter_pack.description, | ||
| "attributedTo": author.activitypub_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above and ties into your proposal. Should attributedTo also be a dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reply as above. I don't think we'd gain anything from it.
| if author.activitypub_id is None: | ||
| # Owner of this starter pack does not have an ActivityPub ID stored yet. | ||
| # This should be resolved somewhere else by a background task, so we | ||
| # return an HTTP 500 error and leave it to the client to try again later. | ||
| return HttpResponseServerError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created the script to populate activitypub_id but and it's now running (will probably take more than a day to go over all accounts).
What I'm seeing is that there are still going to be a lot of accounts where activitypub_id won't be set because a lot of webfinger requests are returning 410, 404 or 403 responses.
Not returning anything when fingerprint id isn't set might not be the best. Instead maybe we can still return results, especially if items and attributedTo are dicts and we can populate the name field...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accounts without working WebFinger resolution should be a niche case. They are valid ActivityPub, but they can't interoperate with Mastodon. Occasionally someone misconfigures their server in some way (and I will say that my WebFinger resolution function from #175 doesn't take host-meta into account, which probably affects a few servers), but the more likely explanation especially for the 410 errors is that the accounts don't exist anymore. Could you send me a list of a few non-working accounts? I'll see what I need to do to update the script.
We absolutely, 100%, cannot return anything about the account if we don't have the ID. We could at best leave out the attributedTo field altogether for that case, but that seems like a data integrity issue. Same for a partial list in items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated my WebFinger backfilling script in #175 to also handle alternative WebFinger domains from host-meta information. Should be a fringe configuration, but I'm aware that a few of them exist in the wild. Differences are only in the body of the get_activitypub_id_from_webfinger function, but it changes how the WebFinger URL is constructed, so I had to adjust more than just the first few lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running the original script 5% of all accounts don't have activitypub_id.
Since the vast majority of these are 410 responses I think I can assume that these accounts are no longer active and I can remove them from the database.
There is a handful that received a 403 response (mostly from awscommunity.social).
Here is a small sample with different error types:
[email protected] Error: status code 403 for WebFinger data
[email protected] Error: status code 403 for WebFinger data
[email protected] Error: status code 403 for WebFinger data
[email protected] Error: Failed to fetch WebFinger data
[email protected] Error: Failed to fetch WebFinger data
[email protected] Error: status code 410 for WebFinger data
[email protected] Error: status code 410 for WebFinger data
[email protected] Error: status code 410 for WebFinger data
[email protected] Error: status code 410 for WebFinger data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, awscommunity.social is gone. It used to be a Mastodon instance I think, but I remember reading about them shutting down.
emacs.ch is also down, but I don't know if that one's temporary.
The other four, with the 410 errors, are deleted accounts from working instances. It's worth noting that their profile URLs all return 410 as well, so they're possible to detect.
I can't find the spot where we do scheduled account updates. Do routine updates use accounts/management/commands/crawler.py? If so, I guess we might not be detecting deleted accounts at all yet. A 410 error is a very safe bet that the account can be removed from the database. A 404 should probably only be taken as such if it lasts for a long time – accidental 404s could pop up during certain kinds of server maintenance.
It could be worthwhile to check https://{instance}/health from time to time for each server and remove (or hide?) these instances and their accounts after prolonged downtime. I believe Mastodon gives up on activity delivery after two weeks.
I'd like to not make all that a prerequisite for this PR though. It's worth investigating separately, but for the purpose of ActivityPub delivery, I think the current way the code handles missing IDs is the correct one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do routine updates use accounts/management/commands/crawler.py? If so, I guess we might not be detecting deleted accounts at all yet.
That's it, yes! We aren't detecting deletions at all. So yeas, all those 410 can safely be deleted.
It could be worthwhile to check https://{instance}/health from time to time for each server and remove (or hide?) these instances and their accounts after prolonged downtime. I believe Mastodon gives up on activity delivery after two weeks.
Good idea!
None of these are prerequisite for this PR. All good 👍
|
Thanks! I'll take a look at the tests later today. |
| # This should be resolved somewhere else by a background task, so we | ||
| # return an HTTP 500 error and leave it to the client to try again later. | ||
| return HttpResponseServerError() | ||
| items = [account.activitypub_id for account in accounts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ActivityPub uses JSON-LD, so serving the URI https://example.org/users/username or (some subset including id of the properties of) the object hosted at https://example.org/users/username is equivalent. Referencing the ID is just more convenient for us for two reasons:
- Since we're not the authoritative source for
https://example.org/users/username(that would be the server atexample.org), we can't really add or change properties here. They'd have to be synced to whatever exists athttps://example.org/users/username. Outdated cached values are not rare, but intentional deviations would be invalid, and a JSON-LD parser would likely strip them out. - If we rehost
https://example.org/users/username's info here, we need to keep it stored and updated. For example, we don't currently store thetype. We could add it to the database like we did with the ID, but unlike the ID it can change at any time, for example if a Mastodon user checks or unchecks the "this account is a bot" toggle. That sounds like effort. - If we say "screw the rules" and add our own info to our local version of
https://example.org/users/username, we'd still need to stay within the ActivityStreams vocabulary, or add our own resolvable JSON-LD@contextwhich sounds like even more of a pain.
I see only downsides.
| "id": request.build_absolute_uri("/s/" + starter_pack.slug + "/"), | ||
| "name": re.sub(r":\w+:", "", starter_pack.title).strip(), | ||
| "summary": starter_pack.description, | ||
| "attributedTo": author.activitypub_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reply as above. I don't think we'd gain anything from it.
| if author.activitypub_id is None: | ||
| # Owner of this starter pack does not have an ActivityPub ID stored yet. | ||
| # This should be resolved somewhere else by a background task, so we | ||
| # return an HTTP 500 error and leave it to the client to try again later. | ||
| return HttpResponseServerError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accounts without working WebFinger resolution should be a niche case. They are valid ActivityPub, but they can't interoperate with Mastodon. Occasionally someone misconfigures their server in some way (and I will say that my WebFinger resolution function from #175 doesn't take host-meta into account, which probably affects a few servers), but the more likely explanation especially for the 410 errors is that the accounts don't exist anymore. Could you send me a list of a few non-working accounts? I'll see what I need to do to update the script.
We absolutely, 100%, cannot return anything about the account if we don't have the ID. We could at best leave out the attributedTo field altogether for that case, but that seems like a data integrity issue. Same for a partial list in items.
| if author.activitypub_id is None: | ||
| # Owner of this starter pack does not have an ActivityPub ID stored yet. | ||
| # This should be resolved somewhere else by a background task, so we | ||
| # return an HTTP 500 error and leave it to the client to try again later. | ||
| return HttpResponseServerError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated my WebFinger backfilling script in #175 to also handle alternative WebFinger domains from host-meta information. Should be a fringe configuration, but I'm aware that a few of them exist in the wild. Differences are only in the body of the get_activitypub_id_from_webfinger function, but it changes how the WebFinger URL is constructed, so I had to adjust more than just the first few lines.
Co-authored-by: Anže Pečar <[email protected]>
for more information, see https://pre-commit.ci
… string representation Co-authored-by: Anže Pečar <[email protected]>
for more information, see https://pre-commit.ci
|
I have added a few tests, for the HTTP content negotation and for the JSON-LD schema. And what do you know, they uncovered a few improvements to how I'm doing the content negotation and have made it more resilient in edge cases. The only problem is with the test setup requirements that the user creating the starter pack must have a full account, and that all accounts (the creator and the five "inhabitants") must all have ActivityPub IDs. I keep getting this error: So the setup for the ActivityPub test is still flat-out missing right now. I put some time into trying to learn Model Bakery to make it happen, but didn't feel like I was making much progress. Is this something you could knock out in a few minutes or should I try again tomorrow? But hey, if I patch the requirements for a full author account and for ActivityPub IDs out of the view code, the tests succeed. So that's something. |
Great! We can definitively make the test have all the necessary data! I'll take a look at this PR tomorrow, it's getting late here so I need to go to bed now 😅 |
|
Okay, a new revision of this PR is in. I have taken a shot at separating the JSON and ActivityPub APIs. There's arguably a little bit of duplication, but while putting this together I got the feeling that the results flow much more naturally than trying to cram a general-use JSON API into the ActivityPub format. Let me know what you think. I'm not fussy about what exactly should be in the JSON result. The way I did it now was to pick a few handfuls of model properties that seemed to make sense to expose, but I'm fairly dispassionate about the JSON API. I'm probably not going to be among its users. At this stage, this PR is more about the general structure of the code and the API separation. While I was at it, I added the ability to request a specific format via a GET parameter, so you can do I've also updated the tests. They're still missing their setup, but the JSON API has preliminary validation. |
|
Whoops. I wanted to push my fixes to your branch but I ended up pushing to the main branch instead 😅 Well the good news is that the changes are now live and you can test them out on fedidevs.com (example: https://fedidevs.com/s/NjA3/?format=activitypub)! If you still need to make changes to the endpoint then feel free to open a new PR. |
|
Thanks for merging (and cleaning up) @anze3db! 🙂 Looks all good from my vantage. I'll take partial responsibility for the main branch thing. For #178 I was working on a separate branch (by necessity, since this one was ongoing at the same time), so that may have inadvertently set some expectations. Glad you got everything sorted out quickly anyway! Before I go around making noise about the new ActivityPub support, I'd like to add per-pack splash images. I'll open an issue for that. |
Building on #175, this pull request adds ActivityPub support to starter packs. This is not intended to be ready to merge right away, there are several spots in here that I want to discuss. Plus it'll only make sense to actually run in practice once the ActivityPub backfilling from the post-merge discussion in #175 is in place. But we can already start talking about it. And if you want to merge it early, it does gracefully handle missing ActivityPub IDs, so the only hard depencency is the stuff from #175 that's already merged.
About the design of what's here:
I took a "keep it simple" approach under the assumption that there is currently nothing else in Fedidevs that we want to add ActivityPub support to. So the changes here are all contained in
starter_packs/views.pyeven though thewants_activitypubfunction, which detects HTTP requests from ActivityPub platforms, is fit for the general case. If we ever do more ActivityPub in various places, we should find a better spot for it. Until then I reckon it makes sense to keep it all close by.In this implementation, we use content negotiation to serve ActivityPub under the same URL as the HTML version. I've read some Django docs saying that the view is the correct place to do this. There are other options besides content negotiation, such as
LinkHTTP headers or HTML elements. There is a draft document by the SWICG that gives an overview. But again, I believe this is the simplest approach, and also coincidentally the one with the best compatibility in the ecosystem.Because we serve only a single ActivityPub schema, it makes more sense to construct it directly from a
dictthan to pull in dependencies related to JSON-LD. One last time: keeping it simple.The schema itself is the one from #pixelfed/starter-kits/issues/1 with some of my own suggested additions. But we should be compatible with it as well as with the ActivityStreams base vocabulary.
I'm going to try to add a couple of line comments to highlight spots where I wasn't 100% sure.