Skip to main content


We apologize for a period of extreme slowness today. The army of AI crawlers just leveled up and hit us very badly.

The good news: We're keeping up with the additional load of new users moving to Codeberg. Welcome aboard, we're happy to have you here. After adjusting the AI crawler protections, performance significantly improved again.

in reply to Codeberg

It seems like the AI crawlers learned how to solve the Anubis challenges. Anubis is a tool hosted on our infrastructure that requires browsers to do some heavy computation before accessing Codeberg again. It really saved us tons of nerves over the past months, because it saved us from manually maintaining blocklists to having a working detection for "real browsers" and "AI crawlers".
in reply to Codeberg

However, we can confirm that at least Huawei networks now send the challenge responses and they actually do seem to take a few seconds to actually compute the answers. It looks plausible, so we assume that AI crawlers leveled up their computing power to emulate more of real browser behaviour to bypass the diversity of challenges that platform enabled to avoid the bot army.

reshared this

in reply to Codeberg

We have a list of explicitly blocked IP ranges. However, a configuration oversight on our part only blocked these ranges on the "normal" routes. The "anubis-protected" routes didn't consider the challenge. It was not a problem while Anubis also protected from the crawlers on the other routes.

However, now that they managed to break through Anubis, there was nothing stopping these armies.

It took us a while to identify and fix the config issue, but we're safe again (for now).

reshared this

in reply to Codeberg

For the load average auction, we offer these numbers from one of our physical servers. Who can offer more?

(It was not the "wildest" moment, but the only for which we have a screenshot)

reshared this

in reply to Codeberg

In the days of single CPU servers (early 90s?) and an interesting filesystem problem, I think I may have seen ~400 at a client site!
in reply to Codeberg

ouch. This remains a cat-and-mouse game.

At least having them solve the Anubis challenge does cost them extra resources, but if they can do that at scale, it doesn't promise a lot of good.

in reply to Codeberg

wow - that looks scary. Thanks for all your work ❤️
in reply to Codeberg

I'm really sorry there isn't a good legal avenue to stave off the abuse. Horrifying.
in reply to Codeberg

I really wish you contacted me at all about this before going public.
in reply to Xe

@cadey I'm sorry if this gave you any unwanted or negative attention. I consider crawlers emulating more of real browser features to bypass protections of websites an inevitable future, and today at least one big crawler seems to have started doing so. ~f
@Xe
in reply to Codeberg

Can we continue this conversation over email after my panic subsides? me@xeiaso.net.
in reply to sam

@thesamesam Unfortunately, I'm not sure if encouraging anyone to reinforce the vendor-lock-in of Microsoft GitHub by making maintainers financially dependent on that platform, is in spirit with our mission. ~f
@sam
in reply to Codeberg

yeowsa. this feels like an arms race that is going to get harder :(
in reply to Codeberg

This is a great number, but I have seen higher in my career. Unfortunately I either have no screenshots or lost what I already have.

5831.24 is pretty good though. Congrats for hitting, hope your head doesn't hurt. :D

in reply to Codeberg

damn. The only time I've seen numbers like this were when a ceph server went down.
in reply to Codeberg

what is the threshold for alerting so? Grafana/Zabbix/Prometheus?
in reply to Codeberg

huh, that's a pretty kernel-heavy workload, so much red
This entry was edited (1 week ago)
in reply to Codeberg

thank you for the details. Very interesting. They are worth a blog post.
in reply to Codeberg

what if you had challenges for AI to perform that made it mine bitcoin for you and you just block them at the end anyway 🤣
in reply to Codeberg

Why not just to block huawei cloud asn prefixes?
It's easy to get them (e.g. from projectdiscovery)
in reply to Lenny

@lenny If you read the thread, you'll notice that this is exactly what we did, except that we made a mistake. ~f
in reply to Codeberg

>now that they managed to break through Anubis
There was no break - it's a simple matter of changing the useragent, or if for some reason there's still a challenge, simply utilizing the plentiful computing power that is available on their servers (which far outstrips the processing power mobile devices have).

Anubis is evil and is proprietary malware - please do not attack your users with proprietary malware.

If you want to stop scraper bots, start serving GNUzip bombs - you can't scrape when your server RAM is full.

dd if=/dev/zero bs=1G count=10 | gzip > /tmp/10GiB.gz
dd if=/dev/zero bs=1G count=100 | gzip > /tmp/100GiB.gz
dd if=/dev/zero bs=1G count=1025 | gzip > /tmp/1TiB.gz

nginx; #serve gzip bombs
location ~* /bombs-path/.*\.gz {
add_header Content-Encoding "gzip";
default_type "text/html";
}

#serve zstd bombs
location ~* /bombs-path/.*\.zst {
add_header Content-Encoding "zstd";
default_type "text/html";
}

Then it's a matter of bait links that the user won't see, but bots will.

SuperDicq reshared this.

in reply to GNU/翠星石

@Suiseiseki Anubis is the option that saved us a lot of work over the past months. We are not happy about it being open core or using GitHub sponsors, but we acknowledge the position from the maintainer: codeberg.org/forgejo/discussio…

Calling our usage of anubis an attack on our users is far-fetched. But feel free to move elsewhere, or host an alternative without resorting to extreme measures. We're happy to see working proof that any other protection can be scaled up to the level of Codeberg. ~f

in reply to Codeberg

@Suiseiseki BTW, we're also actively following the work around iocaine, e.g. come-from.mad-scientist.club/@…

However, as far as we can see, it does not sufficiently protect from crawling. As the bot armies successfully spread over many servers and addresses, damaging one of them doesn't prevent the next one from doing harmful requests, unfortunately. ~f

in reply to Codeberg

A lot of users can not pass Anubis challenges because Anubis does not support every browser and is also incompatible with popular security focussed browser extensions such as JShelter.

Asking your users to enable JavaScript and to disable security extensions like JShelter in order to visit your website is very bad, don't you agree?

I don't think it is far-fetched to call it an attack on your users at all.

in reply to SuperDicq

I don’t think that because some people browse the internet with JavaScript off, should mean that you should just open your server up to being ddosed over and over by aggressive scrapers. Maybe there should be a happy medium, but this is now the world we live in, thanks to grifters like Sam Altman
in reply to cmdr ░ nova ⸸ :~$ 🏳️‍⚧️

Saying Anubis is the only solution the scraper problem is a false dilemma. There are many other methods of stopping scrapers.

This is extremely bad for accessibility and I consider it exclusionary for many people who want to contribute to free software, but now can't do.

in reply to Codeberg

>can be scaled up to the level of Codeberg
He says, on the federated network.

1) Put /botsfuckoff/ path redirect to script that randomly generates 200 links to itself whenever it's accessed
2) Deny in robots.txt
3) Put hidden link to it at the top of the home page

in reply to Codeberg

I believe @Suiseiseki is not referring to codebergs usage of anubis specifically, rather shares fsfs' stance (which I don't share) that Anubis "acts like malware" for making "calculations that a user does not want done": fsf.org/blogs/sysadmin/our-sma…

fsf saying fsf things :)

in reply to Codeberg

@Suiseiseki@freesoftwareextremist.com “We are not happy about it being open core … GH sponsors”

Do you have better suggestions for how we can have a sustainable OSS model that isn’t entirely dependent on core contributors of major projects having full time jobs and then supporting everyone else in whatever free time they might have?

in reply to Codeberg

so, to clarify, do you have evidence that the bots were solving Anubis challenges or not, i.e., it was due to the configuration issue? (I think it's inevitably going to happen if Anubis gets traction. I'm just curious if we're already there or not.) Thanks for your work and transparency on all this.
in reply to Stefano Zacchiroli

@zacchiro Yes, the crawlers completed the challenges. We tried to verify if they are sharing the same cookie value across machines, but that doesn't seem to be the case.

Stefano Zacchiroli reshared this.

in reply to Codeberg

I have a follow up question, though, @Codeberg, re: @zacchiro's question. Is it *possible* that giant human farms of Anubis challenge-solvers actually did it? Or did it all happen so fast that there is no way it could be that?

#Huawei surely could fund such a farm and the routing software needed to get the challenge to the human and back to the bot quickly enough that it might *seem* the bot did it.

in reply to Bradley Kuhn

@bkuhn
Anubis challenges are not solved by humans. It's not like a captcha. It's a challenge that the browser computes, based on the assumption that crawlers don't run real browsers for performance reasons and only implement simpler crawlers.

So at least one crawler now seems to emulate enough browser behaviour to make it pass the anubis challenge. ~f
@zacchiro

in reply to Codeberg

I get it now.

Thanks for taking the time to clue me in.

I'm lucky that I haven't needed to learn about this until now and I'm so sorry you've had to do all this work to fight this LLM training DDoS!

Cc: @zacchiro

in reply to Codeberg

Is your list shared? It would be good to have a list of carefully curated AI-bot block lists.
in reply to Codeberg

I like the idea of them figuring out solving the Anubis challenge only to be blocked afterward
in reply to Henrý Ólson

@nemo Currently not. We wanted to investigate the legal situation with regards to sharing such lists. They could currently contain individual's IP addresses and likely need to be cleaned up first. ~f
in reply to Codeberg

Was the solution to increase the proof-of-work difficulty?
in reply to Steven Sandoval

@baltakatei No. We fixed our config. Now we're blocking the offending IP ranges directly. ~f
in reply to Codeberg

have you tried filing a criminal complaint against the "attacker" because basically it's a breach of ToS and a DoS, right? So it might qualify for a violation of § 303b StGB (German criminal code). I mean, I am no lawyer, but at least it's worth the try?
This entry was edited (1 week ago)
in reply to Codeberg

How much were they slowed down by actually solving the challenges? I was under the impression that the proof of work was the primary intent of Anubis, and the fact that most crawlers just bombed out and didn't even attempt them in the first place was a bonus.
in reply to Codeberg

It makes me wonder: there is a public curated IP blocklist somewhere that we can all use ? I searched a bit, I found only weak robot.txt solutions based on User Agent.
in reply to Codeberg

Seem a bad mouse and cat game, glad that you could stay at the top of it (proves that human can still win). Jesus christ, those big tech compagnies should be held responsable for that shit and pay billions in fine. Maybe then they would think of stopping that insanity.
in reply to Codeberg

Good luck with fighting the bots. I recently moved my OSDev project and site to Codeberg from GitHub and so far it’s been great!

Thank you for helping the open-source community!

in reply to Codeberg

Now what needs to happen is that part of the challenge computes a known answer while the other part does useful computational work, and there's no way for the 'bot to tell which is which -- so it has to do both.

That could maybe contribute computing power to something important like Folding@Home, or even just something pretty like Electric Sheep.

in reply to Woozle Hypertwin

@woozle This topic was discussed in the past. The problem is that cutting useful work in small chunks AND verifying it is very difficult. It might work for some cryptocurrencies, but that's nothing we're interested in.

A proof of concept is more than welcome, but I don't yet know if anyone found a suitable task for this.

~f

in reply to Woozle Hypertwin

(on further thought) ...or is it?

  • Create a set of N problems.
  • Solve a sampling of them.
  • Require the bot to solve all of them.
  • If the bot's solutions to the solved set don't match, then it fails the whole test.

Might that work? I guess there could be problems with trustability of the "unknown" answers -- does that look like the main issue to be solved?

in reply to Woozle Hypertwin

@woozle Remember that users want to get through the challenge page quickly. So the more samples you have, the simpler the individual problems need to be.

~f

in reply to Codeberg

OH RIGHT... I was kinda forgetting about the actual-user time-penalty.
in reply to Woozle Hypertwin

@woozle
I tend to think that if I had "plenty of free time to fight them," I'd dynamically identify which ones were bots, and then honor their requests, but also keep feeding them harder and harder problems to solve, making their costs "go through the roof" quite quickly. And maybe even give them misleading garbage data.

But the would be a lot of work, of course.

And it would be risky, as one might occasionally wrongly identify an actual valid real user.

in reply to Codeberg

Pardon my ignorance, but couldn't they just be using a headless browser, which would still do everything a regular browser does? Just recently, ChatGPT beat Cloudflare's CAPTCHA using a similar system. Is there really any way around this at all? @Codeberg@social.anoxinon.de
in reply to Codeberg

If some of the attack is coming from Huawei's cloud hosting, it might be worth sending a complaint to their abuse department. IME Chinese companies tend to be scared of breaking rules in international dealings like this.
in reply to Codeberg

These companies are evidently willing to pay an absolutely staggering cost to do their scraping.

I wonder, are they paying with their own money, or are they “borrowing” some unsuspecting strangers' compromised computers/routers/etc to do the work?

This entry was edited (1 week ago)
in reply to Codeberg

this is now on the #anubis team’s radar: github.com/TecharoHQ/anubis/is…
This entry was edited (1 week ago)
in reply to Codeberg

Is it possible to configure Anubis to went super hard for certain IP ranges? Not only AI crawers, HuaweiCloud also engaged in bulk copying code repos from Github under the name of GitCode, they even created fake accounts not owned by the original author. Could it be they started doing this to Codeberg, too?
bytefish.medium.com/gitcode-is…
[Chinese] cnblogs.com/gt-it/p/18271287GitCode, a code hosting platform, a joint venture of HuaweiCloud and CSDN (should be a blog service, basically a content farm now) [Chinese] qbitai.com/2023/09/85598.html
This entry was edited (1 week ago)
in reply to Codeberg

I observed them too about a month ago. I then sent the whole AS to Google's recaptcha and it worked (at least people who can solve recaptcha can still access our site while these bots can't).
in reply to Codeberg

boy Huawei is so nasty

I wonder who are the biggest offenders on this matter...

in reply to Codeberg

"AI crawlers learned how to solve the Anubis challenges"

Why does EU discuss chat control and not AI crawlers control again?

in reply to Codeberg

eBPF could be more effective and easy on the CPU, since it acts on a way lower network layer. Anubis kinda has it's limits and it's way too easy to circumvent (as you found out)

Maybe it's worth it to consider eBPF (if not already happened)

And thanks guys for your work. I'm a proud supporter and I'll continue to support your work. Companies shouldn't control the Open Source space

This entry was edited (1 week ago)
in reply to Codeberg

hey, maybe mCaptcha is something for you. They adjust their algo based on system power and use memory more (which is way less abundant in AI server farms)
in reply to Codeberg

It's going to be rat race after all, I expected this to happen eventually. Surprising it took this long.
in reply to Codeberg

Anubis is extremely easy to bypass, you just have to change the User-Agent to not contain Mozilla, please get proper bot protection.

ulveon.net/p/2025-08-09-vangua…
This post talks briefly about other alternatives. Try Berghain, Balooproxy, or go-away.

in reply to ulveon.net

@ulveon This depends on the configuration, and it was not the problem we have been running into today. ~f
in reply to Codeberg

Perhaps it's time stop letting robots solve puzzles and instead feed them bombs. Do we know how well a ZIP bomb works on these crawlers?
in reply to Codeberg

Have you looked into serving these LLM crawlers alternative versions of the site, with poisoned data? (And rate-limiting, of course.) I know it would be additional work for you to implement this, but... it might be effective.

I'm thinking you could have a precomputed set of 1000 different poison repos that get served up randomly, each of which is a Markov-chain-scrambled version of the files in a real repo.

(I wrote codeberg.org/timmc/marko to do something similar to the contents of my blog posts—a Markov model on either characters or words.)

This entry was edited (1 week ago)
in reply to Codeberg

😲🤬 re: what's happened to @Codeberg today.
The AI ballyhoo *is* a real DDoS against one of the few code hosting sites that takes a stand against slurping #FOSS code into LLM training sets — in violation of #copyleft.

Deregulation/lack-of-regulation will bring more of this. ∃ plenty of blame to go around, but #Microsoft & #GitHub deserve the bulk of it; they trailblazed the idea that FOSS code-hosting sites are lucrative targets.

giveupgithub.org

#GiveUpGitHub #FreeSoftware #OpenSource

This entry was edited (1 week ago)
in reply to Bradley Kuhn

@bkuhn if anyone need it, there is this gist showing how to pseudo-automate repository bulk deletion.
gist.github.com/mrkpatchaa/637…

and this tool
reporemover.xyz very handy

This entry was edited (1 week ago)
in reply to serk

IMO, @serk, the better move is not to delete the repository, but to do something like I've done here with my personal “small hacks” repository:

github.com/bkuhn/small-hacks

I'm going to try to make a short video of how to do this, step by step. The main thing is that rather than 404'ing, the repository now spreads the message that we should #GiveUpGitHub!

in reply to Bradley Kuhn

GitHub is still the only full-featured forge that can be viewed without JS (using a little custom userjs). Hilariously, this is caused by the PoW enacted by all alternatives.
in reply to Codeberg

Thank you for defending us the best you guys can! I also appreciate the transparency and honesty. All the more reason to stay for the foreseeable future. :blobfoxfloofhappy:
Unknown parent

mastodon - Link to source
Codeberg

@gturri Anubis sends a challenge. The browser needs to compute the answer with "heavy" work. The server then has "light" work and verifies the challenge.

As far as we can tell, the crawlers actually do the computation and send the correct response. ~f

in reply to Codeberg

could just setup a few traps that crash the AI crawlers or something. This is going to get really annoying and hopefully these bastards don't interfer with some of my work in the long run with what they've been doing on the internet. Scraping is already largely frowned upon so these pos are just making it worse.
in reply to Codeberg

what if the new captcha was get a bug fix PR merged? That'd keep them robits out.
in reply to Codeberg

It’s ok, take it slowly and take care of yourselves and your computer friends.
in reply to Codeberg

Thank You For Your Service. ( I moved to Codeberg, like, yesterday, and signed up a recurring donation )
in reply to Codeberg

Are you guys using traffic shaping and queue management at all? For example putting something like QFQ qdisc on your routers and then marking packets from spammy sources as low-priority and putting them into a low priority queue can be a huge boost in responsiveness for your real customers.
Spammy sources could be those that open new connections too often, transfer too many bytes, or have too many open active connections. All of those kinds of things can be accounted in nftables.
in reply to Daniel Lakeland

If you followed through their posts, they are coming through millions of nodes of residential proxy services, with a single IP usually not requesting more than 1 web page on the same week. You can't filter that over the network.
in reply to bkil

@bkil
Yuck. That does suck. But anything that makes a client stand out can be used to change that clients network priority. In this case perhaps to boost the priority of packets of clients who have initiated more than one connection in the last hour. For those in the questionable group, send them to underpowered servers that return web pages that say "click xyz to continue to your requested page" once you've identified a likely real client they get elevated packet priority and
@Codeberg
in reply to Daniel Lakeland

@bkil
Shuttled to a different back end server. The idea being to make the experience good for clients that act normal and low availability for clients that only connect once a week or once a day or etc. Plus make the questionable client run OCR and LLM on your click page to figure out how to get past it. Easy for a human, expensive for a bot.

Its obviously whack a mole. But if latency is 500-1000ms for bullshit clients and 50ms for your real clients then this is what you want.
@Codeberg

in reply to Daniel Lakeland

@bkil
It means keeping an nftables IP set of millions of real customers but running a Linux server with gigs of ram as an edge router would make this fine. Even a billion real customers would need 16 gigs of RAM for ipv6 which is doable.
@Codeberg
in reply to Codeberg

can you identify the owners? I wonder if they are famous companies or someone else (not asking for names, just wondering).
in reply to Codeberg

I hope you are feeding the crawlerd gabled plausible code examples, so that they produce garbage and keep fellow software developers in their jobs! Or even generate the new job title of "AI garbage disposal officer".
in reply to Codeberg

I've been moving my stuff to Codeberg. Glad to see you have a presence on Mastodon! Thanks for being there.
in reply to Codeberg

Really need to sue them for a denial of service attack, get them banned from touching a computer for 20 year.
in reply to Codeberg

Are you going to publish your work anywhere? I guess it could cause the bot spike again, but I guess more forgejo instances will be hit with this soon so would be good to establish some way to communicate this among other forgejo instances to prevent abuse.