How does Lemmy work with search engines?

melonpunk@lemmy.world · 2 years ago

How does Lemmy work with search engines?

wpuckering@lm.williampuckering.com · edit-2 2 years ago

There’s a lot of things that factor into the answer, but I think overall it’s gonna be pretty random. Some instances are on domains without “Lemmy” in the name, some don’t include “Lemmy” in the site name configuration, and in the case of some like my own instance, I set the X-Robots-Tag response header such that search engines that properly honor the header won’t crawl or index content on my instance. I’ve actually taken things a step further with mine and put all public paths except for the API endpoints behind authentication (so that Lemmy clients and federation still work with it), so you can’t browse my instance content without going through a proper client for extra privacy. But that goes off-topic.

Reddit was centralized so could be optimized for SEO. Lemmy instances are individually run with different configuration at the infrastructure level and the application configuration level, which if most people leave things fairly vanilla, should result in pretty good discovery of Lemmy content across most of these kinds of instances, but I would think most people technical enough to host their own instances would have deviated from defaults and (hopefully) implemented some hardening, which would likely mess with SEO.

So yeah, expect it to be pretty random, but not necessarily unworkable.

OrangeSlice@lemmy.ml · 2 years ago

Easily the best answer here, I think the people who think it will work “just like Reddit” are unfamiliar with federation still, and aren’t used to thinking things through in those terms.

Not to mention that Google results in general have been pretty trash for a couple years now. I don’t expect fediverse content to be prominent for some time unless there is a dedicated service that indexes everything.

itty53@vlemmy.net · 2 years ago

I mean why couldn’t there be a dedicated service that indexes everything? Whoever makes it and gets it working in a user friendly manner is going to have a significant level of control on the content that is shown in the results. If you don’t want it, it isn’t indexed. I don’t have to stretch the imagination to think of parties that have good reason to want to be first to do that across Activity Pub as a whole. Mastodon is already a big frontrunner in that regard.

jmp242 · 2 years ago

I kind of feel like Kagi will be all over this with it’s forum ‘lens’ for search, but it’s paid. Maybe boardreader would focus on this too?

Google search isn’t as good as it used to be and using startpage.com to break the filter bubble isn’t effective as much anymore either. So we probably all also need to start remembering like 1999 and different search engine for different things and looking for what works the best.

melonpunk@lemmy.world · 2 years ago

Great answer, thanks.

I’m not hugely familiar with SEO, but I seem to remember there could be a penalty applied to content that is duplicated as it’s seen as spammy. I might be wrong on how this works though, and it could be based around only content pasted within a single domain.

I just wonder how search engines will deal with seeing the same content across a lot of instances in terms of ranking and noise.

Kresten@feddit.dk · 2 years ago

Why would you actively work against indexing?

wpuckering@lm.williampuckering.com · edit-2 2 years ago

As a general rule, I prevent all of my self-hosted services that are directly exposed to the Internet from being crawled or indexed by search engines. Any service I do expose publicly to the Internet is of course behind proper authentication and is secured using modern best practices and standards, but lowering the visibility and odds of someone stumbling onto services they have no use for, and potentially trying to exploit them, is less likely to happen if they aren’t presented front and center in a search result. I wouldn’t say it’s a proper security measure by any means (obscurity has nothing to do with real security), but blending into the crowd or taking a seat at the back of the room draws less attention to yourself if you don’t care to be the first target in someone’s sights.

So why do I expose any of my self-hosted services to the Internet in the first place, rather than access them exclusively via VPN? For me there’s a few reasons:

Ease of Access - I want the ability to instantly share usage of specific services that I host with friends and family over the Internet, and I can’t expect them to do so over VPN, even if I were to offer to help them get set up
Performance - I use Cloudflare Tunnels to expose my services (no open router ports, ever), so that allows me to use Cloudflare’s CDN for caching static assets such as immutable images, CSS, Javascript, and I’ve extensively tweaked my Cache Rules to squeeze the most of out it
Security - Cloudflare secures my services with their built-in tooling, and I can use Cloudflare Access if I want to limit access further to specific users by means of accounts they already have, such as Google or various social media account providers

…And there are more reasons I could get into, and I could easily expand on the ones above, but I’ll leave it there.

Of course having all of my external traffic flow through Cloudflare means there’s no expectation of data privacy for any payload traversing in and out of my services, but I’ve decided that I’m okay with that for the other benefits I get out of Cloudflare. Nothing’s truly free, right?

But to answer your original question more specifically, and with the context above in mind, why actively work against indexing in the case of my Lemmy instance? Well, I’m the only user on my instance. I only use it as a home server for my account. That means I’m not creating any communities on it, and there’s no content actually originating from my instance proper. Anybody who would end up coming across my instance, if they were to browse, would see content which originates from other instances, and only content from the time that I set up my server and began federating with those other servers and onward. They wouldn’t see every comment from posts that pre-dated my federation, so it would be an incomplete view. They would be better off going directly to the server that originated the content. They could of course do that by following the permalink from my own server, but it’s an extra hop. It might arguably be better in this case if I just remove my server entirely from any possible search results so that if the originating instance is indexable, its content shows up in the results and mine don’t. That would probably be a better user experience for users trying to find Lemmy content via search engines, they’d hopefully land in the originating instance sooner than later.

Long answer, but I wanted to give as much insight and clarity into why I do what I do. Happy to answer any more questions!

Kresten@feddit.dk · 2 years ago

Interesting insight into your setup and thought process.

That makes good sense. I didn’t realize you hosted your instance only for yourself. I might consider that as well in the future.

wpuckering@lm.williampuckering.com · 2 years ago

It’s a good idea to host your own instance (if you can) for a number of reasons, depending on your own skillset and level of knowledge:

If you know how to best optimize web applications and backend databases, you can make your own instance more performant than others who may not have that knowledge. Case in point, I’ve seen various complaints on Lemmy about performance degradation due to an influx of users from Reddit in the past week or so, but I’ve been completely unaffected.
Even if other instances are down, but yours is up, and you are pulling content/federating to the others, you’ll have the latest content from those instances cached on your own, so you’ll still be able to browse whatever you’re subscribed to on those offline instances for the time being (for those who use those instances directly, they’re out of luck).
You own your own content. Updating content on your own instance results in a push to the other instances to bring them in sync. So let’s say you want to delete all of your post history or modify all of your past content to garbage. You can almost completely guarantee that no one can stop you from doing that, since doing so on your own instance will propagate those changes to all of the others it’s aware of where you’ve posted, as long as those instances are still online and reachable. And no one can pull the rug out from under you and make your account, complete with post and comment history from some point in time, inaccessible to yourself (removing this ability to secure your privacy and ownership of your own data).
Load balancing (kind of but not really). Hosting your own instance takes some load off of other instances. Those instances are still bearing load when you push content to them, but it’s a smaller set of operations than doing everything directly on those instances as opposed to performing a sync. This results in overall better health for the Fediverse as a whole.

There’s even more reasons I’m sure, but those are the obvious ones that come to my mind.

fizzym4d@lemmy.fmhy.ml · 2 years ago

Your “off-topic” sounded pretty cool to me! I love that that is something anyone can do when hosting a lemmy instance. You get to choose if it’s searchable on the web! Obviously there are search engines which ignore the no scraping/indexing header, but the rest of what you did should counteract that, noice.

wpuckering@lm.williampuckering.com · edit-2 2 years ago

Yeah, if you’re running something yourself, you can do pretty much whatever you want in order to protect it. Especially if it’s behind a reverse proxy. Firewalls are great for protecting ports, but reverse proxies can be their own form of protection, and I don’t think a lot of people associate them with “protection” so much. Why expose paths (unauthenticated) that don’t need to be? For instance, in my case with my Lemmy instance, all any other instance needs is access to the /api path which I leave open. And all the other paths are behind basic authentication which I can access, so I can still use the Lemmy web interface on my own instance if I want to. But if I don’t want others browsing to my instance to see what communities have been added, or I don’t want to give someone an easy glance into what comments or posts my profile has made across all instances (for a little more privacy), then I can simply hide that behind the curtain without losing any functionality.

It’s easy to think of these things when you have relevant experience with web development, debugging web applications, full stack development, and subject matter knowledge in related areas, if you have a tendency to approach things with a security-oriented mindset. I’m not trying to sound arrogant, but honestly my professional experience has a lot to do with how my personal habits have formed around my hobbies. So I have a tendency to take things as far as I can with everything that I know, and stuff like this is the result lol. Might be totally unnecessary without much actual value, but it errs on the side of “a little more secure”, and why not, if it’s fun?

Arinshot@lemmy.world · 2 years ago

I’d be interested in how you did this, this seems like one of the best ways I’ve seen for securing a lemmy instance.

wpuckering@lm.williampuckering.com · 2 years ago

I have a single Nginx container that handles reverse proxying of all my selfhosted services, and I break every service out into its own configuration file, and use include directives to share common configuration across them. For anyone out there with Nginx experience, my Lemmy configuration file should make it fairly clear in terms of how I handle what I described above:

server {
  include ssl_common.conf;
  server_name lm.williampuckering.com;
  set $backend_client lemmy-ui:1234;
  set $backend_server lemmy-server:8536;
  
  location / {
    set $authentication "Authentication Required";
    include /etc/nginx/proxy_nocache_backend.conf;
    
    if ($http_accept = "application/activity+json") {
      set $authentication off;
      set $backend_client $backend_server;
    }
    if ($http_accept = "application/ld+json; profile=\"https://www.w3.org/ns/activitystreams\"") {
      set $authentication off;
      set $backend_client $backend_server;
    }
    if ($request_method = POST) {
      set $authentication off;
      set $backend_client $backend_server;
    }
    
    auth_basic $authentication;
    auth_basic_user_file htpasswd;
    proxy_pass http://$backend_client;
  }
  
  location ~* ^/(api|feeds|nodeinfo|.well-known) {
    include /etc/nginx/proxy_nocache_backend.conf;
    proxy_pass http://$backend_server;
  }
  
  location ~* ^/pictrs {
    proxy_cache lemmy_cache;
    include /etc/nginx/proxy_cache_backend.conf;
    proxy_pass http://$backend_server;
  }
  
  location ~* ^/static {
    proxy_cache lemmy_cache;
    include /etc/nginx/proxy_cache_backend.conf;
    proxy_pass http://$backend_client;
  }
  
  location ~* ^/css {
    proxy_cache lemmy_cache;
    include /etc/nginx/proxy_cache_backend.conf;
    proxy_pass http://$backend_client;
  }
}

It’s definitely in need of some clean-up (for instance, there’s no need for multiple location blocks that do the same thing for caching, a single expression can handle all of the ones with identical configuration to reduce the number of lines required), but I’ve been a bit lazy to clean things up. However it should serve as a good example and communicate the general idea of what I’m doing.