Proposed Amendment to the Telephone Consumer Protection Act of 1991
Hereto recognizing the changes in technology since the passage of the Telephone Consumer Protection Act of 1991 (47 U.S.C. ยง 227);
Whereas the original Act established the principle that the sender of unsolicited communications shall bear the cost of transmission, not the receiver, and whereas Congress found the shifting of paper, toner, and machine costs to the recipient of unsolicited facsimile transmissions sufficiently objectionable to establish statutory damages per violation;
Whereas advances in automated data collection systems have created an analogous and substantially greater undue burden on the operators of publicly accessible computer servers, consuming bandwidth, compute resources, and volunteer labor without consent or compensation;
Whereas the voluntary compliance mechanism known as "robots.txt" has proven to be a gentleman's agreement, and the gentlemen have apparently left the building;
The following sections shall be appended to the Act:
Section 227(g)(1). Cost of Receipt. Any automated system that accesses a publicly accessible computer server for the purpose of data collection or machine learning model training shall compensate the server operator for the bandwidth and compute resources consumed. The sender pays, not the receiver.
Section 227(g)(2). Opt-In, Not Opt-Out. The content of a publicly accessible computer server is not in the public domain by virtue of its accessibility. Default posture shall be no automated training access. Server operators who wish to permit such access may indicate so via a standardized machine-readable mechanism to be established by the Commission.
Section 227(g)(3). Abuse of Anonymity for Commercial Data Collection. Nothing in this section shall be construed to require identification for ordinary browsing or access to publicly available information. However, the use of anonymity infrastructure (including but not limited to distribution of requests across large numbers of IP addresses to simulate independent human visitors, spoofing of user agent strings, simulation of human browsing behavior, and registration of accounts) for the purpose of commercial-scale data collection while circumventing access controls shall constitute a separate violation per request.
Section 227(g)(4). Enforcement. Statutory damages of not less than $500 per server request made in violation of this section, consistent with the per-violation damages established under the original Act for unsolicited facsimile transmissions.
Section 227(g)(5). Reporting. Violations shall be reported by printing the relevant server access logs and faxing them to the headquarters of the offending entity. In the spirit of the original Act, all complaints must be submitted via facsimile transmission. Email will not be accepted.
Section 227(g)(6). Scope. This section applies to any entity operating automated data collection for the purpose of training machine learning models, regardless of the entity's stated research purpose, nonprofit status, or market capitalization.
Why
I'm on a team that runs a small community site called swgemu.com. We're a group of volunteers trying to keep a dead MMO alive. Star Wars Galaxies was shut down by Sony in 2011, and a handful of people who loved the game decided to reverse-engineer the server and keep it running. We're donation-funded. Nobody gets paid. The community is small. A few hundred active people, with a handful of volunteers.
Our current record for "most users ever online" on our forums is 30,817. That was February 6th, 2026, at 4:51 AM.
Our average active members per day on our forums in 2026 is 67.
That's roughly 460 bots for every real person. And it's getting worse. In 2022 we averaged 529 active members per day. By 2026 that's down to 67, while the bot traffic has exploded in the opposite direction. Our real community is shrinking and the automated traffic is growing so fast it makes the real users invisible in the logs. To make it even more absurd, on December 15, 2024 we announced v1.0 and disabled new threads and replies on most of the forums, leaving only the active test server "Finalizer" open for players to post and discuss. The forums are mostly a frozen archive at this point. The bots are scraping content that isn't even changing.
What we have is an avalanche of bots hitting us from datacenters all across the world. Crawlers, scrapers, "AI researchers," and whatever else is out there hoovering up every piece of text on the internet to feed the next training run. Our little community forum, where people share tips about crafting lightsabers in a fifteen-year-old game, is apparently valuable training data for someone.
I ran the numbers last week. Seven days of AWS load balancer logs. All real data, not estimates.
9.5 million requests. 600 gigabytes of bandwidth. 2.34 million distinct IP addresses. Hitting a forum for a fifteen-year-old dead game with 67 active members.
The bots we can name are not the problem. The bots that refuse to be nameable are.
The well-behaved crawlers (Googlebot, Amazonbot, ClaudeBot, GPTBot, Applebot, Bingbot, and a dozen others) identify themselves, pace themselves, and accept "no" for an answer. When our WAF blocks GPTBot, it politely keeps retrying at low rates and getting nowhere. They're knocking on a locked door and accepting the lock. All the named bots combined account for about 15% of our traffic. We can rate-limit them, robots.txt them, allow-list the ones we want, deny the ones we don't. Those tools work because the bot is willing to be addressed. They are not the problem. If anything, they prove that doing this responsibly is technically possible.
At our peak we had about 2,300 unique IPs playing the actual game on any given day. Last week 2.34 million unique IPs hit our forums. That's over a thousand scraper IPs for every player we had at our best, hitting a forum for a game that most of them have probably never heard of.
The problem is the other nearly three-quarters.
Last week our load balancer served over 7 million requests from just 347 distinct browser fingerprints, each one shared by more than a thousand different IP addresses. Those requests came from 5.4 million separate IP appearances in seven days. Most IPs made one "next page" request and then disappeared forever. When you look at the sequence, the pattern is obvious. A single page load is being distributed across hundreds of IPs. 555 gigabytes of bandwidth in a week, served from a donation-funded volunteer site, to anonymous commercial crawlers we have no way to bill, contact, or refuse.
One single fake "Chrome on Mac" user-agent string in a recent 7-day sample. Roughly 580,000 requests from over 430,000 distinct IPs. Median requests per IP: one. Each IP shows up once and vanishes. The response size averages 64 KB (eight times larger than what the polite crawlers pull), because they're downloading whole threads and attachments. 35 gigabytes from one fake browser string. Now multiply by 347.
There's no User-Agent to block (it's a stock browser string shared with millions of real users). No IP to rate-limit (each IP makes one request and never returns). No robots.txt to honor (those mechanisms require an identity). The entire architecture is purpose-built to circumvent every consent and rate-limiting tool the open web has standardized on for the last 30 years.
Last week our scraper traffic came from more than 100 countries. Not because the world is fascinated by a Star Wars Galaxies emulator of a dead MMO. Because residential proxy networks have rented or compromised IPs in nearly every country on earth, and the people running these crawlers are paying for that camouflage on purpose. Two patterns sit side by side in our logs. A few thousand Netherlands datacenter IPs each made over 170 requests last week, wholesale scraping from EU hosting providers. Underneath that, over 760,000 US residential IPs each made an average of three requests and disappeared. Same operator, same operation. The datacenters do the heavy lifting and the residential pool provides the camouflage. A small residential connection in Brazil made two requests to our forum last week. So did one in Tunisia, one in Nepal, one in Iraq. None of those people knew. Whoever rents those IPs sold them by the gigabyte to a company that wanted to scrape our forum, and our forum got the bandwidth bill.
Conventional wisdom says block AWS, block Google Cloud, block Azure, block DigitalOcean. That's where the bots live. We did the math. Across 2.34 million distinct IP addresses that hit our forum from un-attributed scrapers in a single week, less than 1% came from those four cloud providers combined. Ninety-nine percent came from somewhere else. Residential ISP addresses rented from real households by "residential proxy" companies, and a long tail of smaller hosting providers in countries that don't answer abuse emails. You cannot solve this with a deny-list. The IPs are someone's grandmother's cable modem, sold to a scraping company by the gigabyte, by a middleman the grandmother has never heard of, to feed a model trained by a company she will never use. Every link in the chain is perfectly deniable. That is the design.
Our forum is busier at 3 AM than at 1 PM. A normal community forum has a deep overnight trough and a daytime peak (humans wake up, log in over coffee, post during lunch, drift off after dinner). Our traffic curve is essentially flat because the bots don't sleep, and the bots are now what the site is for.
One day in May our registration form was hit 15,000 times in 24 hours. Our typical real-signup rate is three or four a day. The WAF was so saturated that 1,728 requests timed out at the load balancer, and those timeouts included real humans trying to join the community. The volume only dropped after I manually added scraper IP ranges to the firewall. That is what scraper traffic looks like at our scale: it doesn't just slow you down, it actively breaks the door for the real users trying to walk through.
We've debated flattening the site to static HTML to make it easier to serve and cache, but that's more work for us. We could turn on more CDN support, but that's more money. Every option is either our time or our money spent solving a problem created by someone else's dream of becoming the next AI billionaire. Every hour we spend on this is an hour we're not spending on the thing we actually care about, which is keeping the game alive for the people who love it.
This is the fax machine argument all over again. Someone decided our content was worth taking, and they made us pay for the privilege of having it taken. The difference is scale. In 1991 you might get a few dozen junk faxes a day. In 2026 you get 600 gigabytes of scraper traffic in a week. And unlike a fax machine, a website doesn't run out of paper and stop. It just gets slower, and the real users give up, and one day there is no one left but the scrapers and the operator they're billing for the privilege.
I want to be clear about something. I am not against AI. I literally just wrote a book about working with human creativity and large language models (LLM). I use them every single day and my results keep getting better. I write open source code and I fully expect it to get slurped up by the bots and into LLMs. GitHub and others are presumably paying that bill, and they have the people and resources to manage programmatic access or force the crawlers to pay for it. Anything I've written publicly, anything I've put out into the open, I accept that's the reality of sending things into the ether. Who knows where it lands.
But there's a difference between accepting that reality and watching a small, volunteer, donation-driven site get brutalized by it. They ignore our robots.txt. We spend real money on a web application firewall. We spend real money on scaling. We spend real time limiting traffic, trying to keep our service alive for our human users. And it's still a constant battle. It's salt in the wound. You're already pouring your time and your life into this project out of love for a dead game, and now you're pouring your money into it too, just to make someone else richer.
The sad thing is we want people to find our site. We want people to discover the scrapbook we host with all the details anyone would need to play on a Star Wars Galaxies server. We want to spread the news as we look to launch new servers and services for the community. We're not trying to hide. We just don't want our actual human visitors stuck behind a wall of CAPTCHAs and rate limiters we had to build because 30,000 bots showed up uninvited.
The companies sending those bots are not small operations scraping by on donations. They are some of the most valuable companies on earth. They are spending billions on GPU clusters and training runs and they cannot be bothered to ask permission or compensate the people whose content makes their products work. A company worth a trillion dollars is using the bandwidth of a donation-funded fan site to train a model they will sell for profit.
I know this will never pass. It's not even about lobbyists. It's a tough technical problem and I don't think politicians or regulators can quite understand what's happening. The TCPA was easy because senators could see fax spam piling up in their own offices. Nobody in Congress is watching their cloud bill go up because bots are scraping their community forum. They're not even paying for it, and most of them wouldn't know if they were. And even if something did pass, the regulators would most likely get it wrong anyway. I'm at a loss as to how I'd even report a violation. In the early days I would send dutiful emails to abuse@ addresses at the offending providers. Never got a single reply. No change in behavior. So I switched to blocking them and ended up with over 10,000 IP ranges in our block list, some as large as /16, and we're still getting hammered. I suppose I could always print out my logs and fax them.
But the TCPA didn't pass because the fax machine lobby was powerful. It passed because the argument was simple enough that a senator could understand it: someone else is making you pay for their advertising. The TCPA worked because the fax machine had to be physically wired to a phone line, and that phone line had to be paid by someone identifiable.
Someone else is making you pay for their training data. And the scraper economy has spent the last five years engineering a substrate where no link in the chain is identifiable. The IP is anonymous. The operator is anonymous. The proxy reseller is anonymous. The buyer is anonymous. None of that is an accident. That's the product. I'm all for anonymous use of the internet. There are perfectly valid reasons not to force someone to show their drivers license to walk into a library and read the news. But there's a difference between a person browsing anonymously and a company routing millions of requests (billions, if you consider all the other sites they're hitting simultaneously) through residential proxies to commercially extract content while avoiding being told "no." That's not anonymity. That's abuse of anonymity at commercial scale, and it should be illegal. The argument hasn't changed. The fax machine just got a lot bigger.
Under this proposed framework, last week's traffic would create approximately $4.77 billion in statutory damages. To preserve your anonymity, payment in BTC is acceptable. Please advise when transfers will commence.
In the meantime, if you need me, I'll be reading server logs and blocking another datacenter that just registered an account named "ResearchBot2026" and downloaded our entire archive of lightsaber crafting guides at 4 AM. :)
The Hard Data
Seven days of AWS ALB access logs, 2026-05-03 through 2026-05-09 (7 complete days). All numbers from real Athena queries, not estimates. Query files committed for reproducibility.
Overall
| Metric | Value |
|---|---|
| Total requests | 9,540,534 |
| Total bandwidth | ~600 GB (~86 GB/day) |
| Distinct client IPs | 2,330,308 |
Traffic breakdown by category
| Category | Requests | % |
|---|---|---|
| Browser-shaped (no bot UA) | 7,909,034 | 82.9% |
| Self-identified bot | 1,473,448 | 15.4% |
| Health check (internal) | 121,325 | 1.3% |
| Self-identified tool (curl/wget/scanners) | 36,727 | 0.4% |
Browser-shaped traffic by IP sharing (the anonymity abuse fingerprint)
89% of browser-shaped traffic (73.7% of all traffic) comes from just 347 browser fingerprints each shared across 1,000+ IPs. That bucket alone served 554.86 GB in 7 days.
One proxy-pool UA in detail (fake "Chrome on Mac")
| Metric | Value |
|---|---|
| Total requests | 580,841 |
| Distinct IPs | 430,661 |
| Median requests per IP | 1 (shows up once, disappears) |
| p99 requests per IP per hour | 2 |
| HTTP 2xx rate | 60.1% (getting real content) |
| Avg response size | 64 KB (8x larger than polite crawlers) |
| Total served | 35 GB |
Diurnal pattern (hourly avg requests, EST)
| Hour | Avg requests |
|---|---|
| 3 AM | 68,911 (busiest) |
| 9 AM | 67,081 |
| 12 PM | 53,090 |
| 1 PM | 47,955 (quietest) |
| 6 PM | 57,618 |
| 11 PM | 58,815 |
A real community forum has a deep overnight trough and a daytime peak. Our busiest hour is 3 AM.
Registration-form assault (May 5, 2026)
| Metric | Value |
|---|---|
| Normal real signups per day | 3.7 |
| Registration hits on May 5 | 15,009 |
| WAF challenges (HTTP 460) | 8,004 |
| Load balancer timeouts (HTTP 504) | 1,728 |
| Ratio to normal | ~4,000x |
Geographic distribution of abuser traffic (top 10 countries)
2,330,308 distinct IPs analyzed via MaxMind GeoLite2. Traffic confirmed from 100+ countries.
| Country | Requests | % | Distinct IPs |
|---|---|---|---|
| United States | 2,595,182 | 32.8% | 763,092 |
| Netherlands | 973,206 | 12.3% | 5,807 (datacenter signature) |
| United Kingdom | 616,997 | 7.8% | 129,795 |
| Brazil | 554,862 | 7.0% | 214,500 |
| Iraq | 238,869 | 3.0% | 42,201 |
| China | 202,369 | 2.6% | 3,965 (datacenter signature) |
| Canada | 192,095 | 2.4% | 65,974 |
| Chile | 108,774 | 1.4% | 34,726 |
| Bangladesh | 107,858 | 1.4% | 31,874 |
| Argentina | 100,638 | 1.3% | 35,217 |
Two distinct patterns: Netherlands and China show high req/IP (datacenter scraping from hosting providers). US, Brazil, Iraq, Bangladesh show low req/IP with massive IP fanout (residential proxy networks).
The cloud-deny-list myth (global abuser IPs by provider)
| Provider | IPs | % IPs |
|---|---|---|
| "other" (residential ISPs + small hosting) | 2,324,306 | 99.7% |
| DigitalOcean | 3,664 | 0.2% |
| AWS | 1,269 | 0.1% |
| Azure | 775 | 0.0% |
| GCP | 294 | 0.0% |
AWS + GCP + Azure + DigitalOcean combined: 0.3% of abuser IPs.
Less than 1% of abuser traffic comes from the four major cloud providers. "Other" is residential ISP addresses (Comcast, Spectrum, AT&T, BT, Vodafone, JIO, etc.) rented by proxy services (Bright Data, Oxylabs, IPRoyal, Smartproxy), plus smaller hosting providers (Hetzner, OVH, Worldstream, Contabo) that don't appear in big-cloud IP range lists.