We apologize for a period of extreme slowness today. The army of AI crawlers just leveled up and hit us very badly.

The good news: We're keeping up with the additional load of new users moving to Codeberg. Welcome aboard, we're happy to have you here. After adjusting the AI crawler protections, performance significantly improved again.

in reply to Codeberg

It seems like the AI crawlers learned how to solve the Anubis challenges. Anubis is a tool hosted on our infrastructure that requires browsers to do some heavy computation before accessing Codeberg again. It really saved us tons of nerves over the past months, because it saved us from manually maintaining blocklists to having a working detection for "real browsers" and "AI crawlers".
in reply to Codeberg

However, we can confirm that at least Huawei networks now send the challenge responses and they actually do seem to take a few seconds to actually compute the answers. It looks plausible, so we assume that AI crawlers leveled up their computing power to emulate more of real browser behaviour to bypass the diversity of challenges that platform enabled to avoid the bot army.

reshared this

in reply to Codeberg

We have a list of explicitly blocked IP ranges. However, a configuration oversight on our part only blocked these ranges on the "normal" routes. The "anubis-protected" routes didn't consider the challenge. It was not a problem while Anubis also protected from the crawlers on the other routes.

However, now that they managed to break through Anubis, there was nothing stopping these armies.

It took us a while to identify and fix the config issue, but we're safe again (for now).

reshared this

in reply to Codeberg

The media in this post is not displayed to visitors. To view it, please go to the original post.

For the load average auction, we offer these numbers from one of our physical servers. Who can offer more?

(It was not the "wildest" moment, but the only for which we have a screenshot)

reshared this

in reply to Codeberg

>now that they managed to break through Anubis
There was no break - it's a simple matter of changing the useragent, or if for some reason there's still a challenge, simply utilizing the plentiful computing power that is available on their servers (which far outstrips the processing power mobile devices have).

Anubis is evil and is proprietary malware - please do not attack your users with proprietary malware.

If you want to stop scraper bots, start serving GNUzip bombs - you can't scrape when your server RAM is full.

dd if=/dev/zero bs=1G count=10 | gzip > /tmp/10GiB.gz
dd if=/dev/zero bs=1G count=100 | gzip > /tmp/100GiB.gz
dd if=/dev/zero bs=1G count=1025 | gzip > /tmp/1TiB.gz

nginx; #serve gzip bombs
location ~* /bombs-path/.*\.gz {
add_header Content-Encoding "gzip";
default_type "text/html";
}

#serve zstd bombs
location ~* /bombs-path/.*\.zst {
add_header Content-Encoding "zstd";
default_type "text/html";
}

Then it's a matter of bait links that the user won't see, but bots will.

SuperDicq reshared this.

in reply to GNU/翠星石

@Suiseiseki Anubis is the option that saved us a lot of work over the past months. We are not happy about it being open core or using GitHub sponsors, but we acknowledge the position from the maintainer: codeberg.org/forgejo/discussio…

Calling our usage of anubis an attack on our users is far-fetched. But feel free to move elsewhere, or host an alternative without resorting to extreme measures. We're happy to see working proof that any other protection can be scaled up to the level of Codeberg. ~f

in reply to Codeberg

@Suiseiseki BTW, we're also actively following the work around iocaine, e.g. come-from.mad-scientist.club/@…

However, as far as we can see, it does not sufficiently protect from crawling. As the bot armies successfully spread over many servers and addresses, damaging one of them doesn't prevent the next one from doing harmful requests, unfortunately. ~f

in reply to Codeberg

A lot of users can not pass Anubis challenges because Anubis does not support every browser and is also incompatible with popular security focussed browser extensions such as JShelter.

Asking your users to enable JavaScript and to disable security extensions like JShelter in order to visit your website is very bad, don't you agree?

I don't think it is far-fetched to call it an attack on your users at all.

in reply to Codeberg

I have a follow up question, though, @Codeberg, re: @zacchiro's question. Is it *possible* that giant human farms of Anubis challenge-solvers actually did it? Or did it all happen so fast that there is no way it could be that?

#Huawei surely could fund such a farm and the routing software needed to get the challenge to the human and back to the bot quickly enough that it might *seem* the bot did it.

in reply to Bradley M. Kühn

@bkuhn
Anubis challenges are not solved by humans. It's not like a captcha. It's a challenge that the browser computes, based on the assumption that crawlers don't run real browsers for performance reasons and only implement simpler crawlers.

So at least one crawler now seems to emulate enough browser behaviour to make it pass the anubis challenge. ~f
@zacchiro

in reply to Woozle Hypertwin

(on further thought) ...or is it?

  • Create a set of N problems.
  • Solve a sampling of them.
  • Require the bot to solve all of them.
  • If the bot's solutions to the solved set don't match, then it fails the whole test.

Might that work? I guess there could be problems with trustability of the "unknown" answers -- does that look like the main issue to be solved?

in reply to Woozle Hypertwin

@woozle
I tend to think that if I had "plenty of free time to fight them," I'd dynamically identify which ones were bots, and then honor their requests, but also keep feeding them harder and harder problems to solve, making their costs "go through the roof" quite quickly. And maybe even give them misleading garbage data.

But the would be a lot of work, of course.

And it would be risky, as one might occasionally wrongly identify an actual valid real user.

in reply to Codeberg

Pardon my ignorance, but couldn't they just be using a headless browser, which would still do everything a regular browser does? Just recently, ChatGPT beat Cloudflare's CAPTCHA using a similar system. Is there really any way around this at all? @Codeberg@social.anoxinon.de
in reply to Codeberg

this is now on the #anubis team’s radar: github.com/TecharoHQ/anubis/is…
This entry was edited (6 months ago)
in reply to Codeberg

Is it possible to configure Anubis to went super hard for certain IP ranges? Not only AI crawers, HuaweiCloud also engaged in bulk copying code repos from Github under the name of GitCode, they even created fake accounts not owned by the original author. Could it be they started doing this to Codeberg, too?
bytefish.medium.com/gitcode-is…
[Chinese] cnblogs.com/gt-it/p/18271287GitCode, a code hosting platform, a joint venture of HuaweiCloud and CSDN (should be a blog service, basically a content farm now) [Chinese] qbitai.com/2023/09/85598.html
This entry was edited (6 months ago)
in reply to Codeberg

eBPF could be more effective and easy on the CPU, since it acts on a way lower network layer. Anubis kinda has it's limits and it's way too easy to circumvent (as you found out)

Maybe it's worth it to consider eBPF (if not already happened)

And thanks guys for your work. I'm a proud supporter and I'll continue to support your work. Companies shouldn't control the Open Source space

This entry was edited (6 months ago)
in reply to Codeberg

Anubis is extremely easy to bypass, you just have to change the User-Agent to not contain Mozilla, please get proper bot protection.

ulveon.net/p/2025-08-09-vangua…
This post talks briefly about other alternatives. Try Berghain, Balooproxy, or go-away.

in reply to Codeberg

Have you looked into serving these LLM crawlers alternative versions of the site, with poisoned data? (And rate-limiting, of course.) I know it would be additional work for you to implement this, but... it might be effective.

I'm thinking you could have a precomputed set of 1000 different poison repos that get served up randomly, each of which is a Markov-chain-scrambled version of the files in a real repo.

(I wrote codeberg.org/timmc/marko to do something similar to the contents of my blog posts—a Markov model on either characters or words.)

This entry was edited (6 months ago)
in reply to Codeberg

😲🤬 re: what's happened to @Codeberg today.
The AI ballyhoo *is* a real DDoS against one of the few code hosting sites that takes a stand against slurping #FOSS code into LLM training sets — in violation of #copyleft.

Deregulation/lack-of-regulation will bring more of this. ∃ plenty of blame to go around, but #Microsoft & #GitHub deserve the bulk of it; they trailblazed the idea that FOSS code-hosting sites are lucrative targets.

giveupgithub.org

#GiveUpGitHub #FreeSoftware #OpenSource

This entry was edited (6 months ago)
in reply to serk

IMO, @serk, the better move is not to delete the repository, but to do something like I've done here with my personal “small hacks” repository:

github.com/bkuhn/small-hacks

I'm going to try to make a short video of how to do this, step by step. The main thing is that rather than 404'ing, the repository now spreads the message that we should #GiveUpGitHub!

in reply to Codeberg

Are you guys using traffic shaping and queue management at all? For example putting something like QFQ qdisc on your routers and then marking packets from spammy sources as low-priority and putting them into a low priority queue can be a huge boost in responsiveness for your real customers.
Spammy sources could be those that open new connections too often, transfer too many bytes, or have too many open active connections. All of those kinds of things can be accounted in nftables.
in reply to bkil

@bkil
Yuck. That does suck. But anything that makes a client stand out can be used to change that clients network priority. In this case perhaps to boost the priority of packets of clients who have initiated more than one connection in the last hour. For those in the questionable group, send them to underpowered servers that return web pages that say "click xyz to continue to your requested page" once you've identified a likely real client they get elevated packet priority and
@Codeberg
in reply to Daniel Lakeland

@bkil
Shuttled to a different back end server. The idea being to make the experience good for clients that act normal and low availability for clients that only connect once a week or once a day or etc. Plus make the questionable client run OCR and LLM on your click page to figure out how to get past it. Easy for a human, expensive for a bot.

Its obviously whack a mole. But if latency is 500-1000ms for bullshit clients and 50ms for your real clients then this is what you want.
@Codeberg