@FaceDeer@fedia.io cover
@FaceDeer@fedia.io avatar

FaceDeer

@FaceDeer@fedia.io

Basically a deer with a human face. Despite probably being some sort of magical nature spirit, his interests are primarily in technology and politics and science fiction.

Spent many years on Reddit and then some time on kbin.social.

This profile is from a federated server and may be incomplete. View on remote instance

FaceDeer ,
@FaceDeer@fedia.io avatar

Maybe it's "simple as that" if you're just expressing an opinion, but what's the legal basis for it?

FaceDeer ,
@FaceDeer@fedia.io avatar

The GDPR says that information that has been anonymized, for example through statistical analysis, is fine. LLM training is essentially a form of statistical analysis. There's hardly anything in law that is "simple."

FaceDeer ,
@FaceDeer@fedia.io avatar

You could say it's to "circumvent" the law or you could say it's to comply with the law. As long as the PII is gone what's the problem?

FaceDeer ,
@FaceDeer@fedia.io avatar

It is impossible for them to contain more than just random fragments, the models are too small for it to be compressed enough to fit. Even the fragments that have been found are not exact, the AI is "lossy" and hallucinates.

The examples that have been found are examples of overfitting, a flaw in training where the same data gets fed into the training process hundreds or thousands of time over. This is something that modern AI training goes to great lengths to avoid.

FaceDeer ,
@FaceDeer@fedia.io avatar

You don't think LLMs are being trained off of this content too? Nobody needs to bother "announcing a deal" for it, it's being freely broadcast.

FaceDeer ,
@FaceDeer@fedia.io avatar

Buddy, I just want to type a search term and get results.

Telemetry can help them do better at providing that. Devs aren't magical beings, they don't know what's working and what's not unless someone tells them.

FaceDeer ,
@FaceDeer@fedia.io avatar

No, this analogy would make more sense if it was a matter of recording a large number of interactions between customers and tellers to ensure that the window isn't interfering with their interactions. Is the window the right size? Can the customer and teller hear each other through it? Is that little hole at the bottom large enough to let through the things they need to physically exchange? If you deploy the windows and then never gather any telemetry you have no idea whether it's working well or if it could be improved.

FaceDeer ,
@FaceDeer@fedia.io avatar

The analogy isn't perfect, no analogy ever is.

In this case the content of the search is all that really matters for the quality of the search. What else would you suggest be recorded, the words-per-minute typing speed, the font size? If they want to improve the search system they need to know how it's working, and that involves recording the searches.

It's anonymized and you can opt out. Go ahead and opt out. There'll still be enough telemetry for them to do their work.

FaceDeer ,
@FaceDeer@fedia.io avatar

Fortunately it doesn't have to be exactly like the real thing to be useful. Just ask machine learning scientists.

FaceDeer ,
@FaceDeer@fedia.io avatar

The time limit is a century or so, so that's something our descendants can figure out.

FaceDeer ,
@FaceDeer@fedia.io avatar

So, Microsoft recognized and responded to all the complaints by removing the feature that people were objecting to.

Resulting headline: "Microsoft is trying to hide the evidence that they were thinking of doing that thing we hated! Hate them harder!"

Do people want companies to just ignore complaints completely because there's no way to satisfy anyone anyway?

FaceDeer ,
@FaceDeer@fedia.io avatar

And that's what Microsoft has apparently done in this case, yet it's being spun negatively anyway.

5e has an advantage of not requiring doctorate in quantum physics to run ( ttrpg.network )

I would usually be sad to see another original RPG go 5e compatible but Neuroshima was infamously poorly designed ruleset, possibly worse than Shadowrun. I probably won't be running it, but may steal statblocks for my 5e game if I need weird stuff again.

FaceDeer ,
@FaceDeer@fedia.io avatar

And they also dual-licensed most of the 5e SRD under Creative Commons as part of the "oh crap we didn't expect everyone to be mad enough to actually hurt our bottom line" drawback from the OGL debacle.

FaceDeer ,
@FaceDeer@fedia.io avatar

Too late, there's a little blood in the water so now everyone hates Microsoft and is pouncing on every drop they think they smell. Being part of an angry mob is fun!

Don't worry, in a couple of weeks or months some other big company or rich person will become the focus and everyone will forget about Microsoft again.

FaceDeer ,
@FaceDeer@fedia.io avatar

I've been saying this for years, this was an incredibly boneheaded move by the Internet Archive and they just keep on doubling down on it. They shouldn't have done it in the first place. When they got sued, they should have immediately admitted they screwed up and settled - the publishers would probably have been fine with a token punishment and a promise to shut down their ebook library, it's not like IA cost them anything significant. But they just keep on fighting, and it's only making things worse.

This isn't even IA's purpose in the first place! They archive the Internet. They're like a guy who's caring for a precious baby who decides he should go poke a bear with a stick, and when the bear didn't respond at first he whacked it over the nose with the stick instead. Now the bear's got his leg and he's screaming "oh no, protect my baby!" And it's entirely his fault the baby's in danger.

FaceDeer ,
@FaceDeer@fedia.io avatar

Indeed, which is why I'm furious at the Internet Archive's leadership for merrily dancing out into a minefield completely unbidden.

FaceDeer ,
@FaceDeer@fedia.io avatar

You realize that if cases like this are won then only the "giant fucking corporations" are going to be able to afford the datasets to train AI with?

FaceDeer ,
@FaceDeer@fedia.io avatar

I don't think you're familiar with the sort of resources necessary to train a useful LLM up from scratch. Individuals won't have access to that for personal use.

FaceDeer ,
@FaceDeer@fedia.io avatar

They're the ones training "base" models. There are a lot of smaller base models floating around these days with open weights that individuals can fine-tune, but they can't start from scratch.

What legislation like this would do is essentially let the biggest players pull the ladders up behind them - they've got their big models trained already, but nobody else will be able to afford to follow in their footsteps. The big established players will be locked in at the top by legal fiat.

All this aside from the conceptual flaws of such legislation. You'd be effectively outlawing people from analyzing data that's publicly available to anyone with eyes. There's no basic difference between training an LLM off of a website and indexing it for a search engine, for example. Both of them look at public data and build up a model based on an analysis of it. Neither makes a copy of the data itself, so existing copyright laws don't prohibit it. People arguing for outlawing LLM training are arguing to dramatically expand the concept of copyright in a dangerous new direction it's never covered before.

FaceDeer ,
@FaceDeer@fedia.io avatar

But you're claiming that there's already no ladder. Your previous paragraph was about how nobody but the big players can actually start from scratch.

Adding cost only makes the threshold higher. The opposite of the way things should be going.

All this aside from the conceptual flaws of such legislation. You'd be effectively outlawing people from analyzing data that's publicly available

How? This is a copyright suit.

Yes, and I'm saying that it shouldn't be. Analyzing data isn't covered by copyright, only copying data is covered by copyright. Training an AI on data isn't copying it. Copyright should have no hold here.

Like I said in my last comment, the gathering of the data isn't in contention. That's still perfectly legal and anyone can do it. The suit is about the use of that data in a paid product.

That's the opposite of what copyright is for, though. Copyright is all about who can copy the data. One could try to sue some of these training operations for having made unauthorized copies of stuff, such as the situation with BookCorpus (a collection of ebooks that many LLMs have trained on that is basically pirated). But even in that case the thing that is a copyright violation is not the training of the LLM itself, it's the distribution of BookCorpus. And one detail of piracy that the big copyright holders don't like to talk about is that generally speaking downloading pirated material isn't the illegal part, it's uploading it, so even there an LLM trainer might be able to use BookCorpus. It's whoever it is that gave them the copy of BookCorpus that's in trouble.

Once you have a copy of some data, even if it's copyrighted, there's no further restriction on what you can do with that data in the privacy of your own home. You can read it. You can mulch it up and make paper mache sculptures out of it. You can search-and-replace the main character's name with your own, and insert paragraphs with creepy stuff. Copyright is only concerned with you distributing copies of it. LLM training is not doing that.

If you want to expand copyright in such a way that rights-holders can tell you what analysis you can and cannot subject their works to, that's a completely new thing and it's going down a really weird and dark path for IP.

FaceDeer ,
@FaceDeer@fedia.io avatar

Oh, for crying out loud, Internet Archive. This is not the fight you should be fighting.

The Internet Archive is the steward of an incredibly valuable repository of archived information. Much of what it's got squirrelled away is likely unique, irreplaceable historical records of things that have otherwise been lost. And they're risking all of that in this quixotic battle to share books that are widely available anyway and not at all at risk.

"Lending" out those books in the way that they did was blatant copyright violation spitting directly into the eye of publishers known to be litigious and vindictive. All to fight for a point that's not part of their mandate, archiving the Internet. They're going to lose and it's going to hurt them badly.

Each copy can only be loaned to one person at a time, to mimic the lending attributes of physical books.

Internet Archive believes that its approach falls under fair use but publishers Hachette, HarperCollins, John Wiley, and Penguin Random House disagree. They filed a lawsuit in 2020 equating IA’s controlled digital lending operation to copyright infringement.

That is not what the lawsuit was about, Internet Archive. If you're going to fight this fight then be honest about what exactly you're fighting for. The lawsuit in 2020 was not about one-person-at-a-time lending, it was about your "COVID Emergency Library" where you removed all restrictions and let people download books freely.

I strongly believe that copyright has gone berserk of the decades and grown like an uncontrolled weed, harming the intellectual commons for the sake of megacorporations' profits. I'm a subscriber on this piracy community, after all. I believe in the position that Internet Archive is fighting for here, despite all the downvotes I'm surely about to be hammered with. But they shouldn't be the ones fighting it. Let someone else take this one on. Sci-Hub or Library Genesis, maybe.

FaceDeer ,
@FaceDeer@fedia.io avatar

I expect them to not provoke the $200 billion lawsuit in the first place. They should never have done the "Emergency Library", it was an obvious boneheaded decision.

Then, once they had done it and the inevitable lawsuit came down on them, they should have tried to settle the lawsuit. Not fight to the bitter end, not double down. They're only making it worse for themselves. It's not simply losing the lawsuit that could destroy them, it's refusing to negotiate.

FaceDeer ,
@FaceDeer@fedia.io avatar

The emergency library followed the same legal framework that ebook lending follows at local libraries.

No, it did not. From the Wikipedia article:

On March 24, 2020, as a result of shutdowns caused by the COVID-19 pandemic, the Internet Archive opened the National Emergency Library, removing the waitlists used in Open Library and expanding access to these books for all readers.

Emphasis added. They took the limits off.

What the libraries do is already in a legal grey area, the publishers just don't go after it because it's more trouble than it's worth and would bring bad press. Like how most rightsholders ignore fanfiction. But the IA went way beyond that and smacked them in the face.

Don't blame IA for fulfilling their mission to make knowledge free.

Their mission is archiving the Internet. a mission that they are putting at risk with this stunt.

FaceDeer ,
@FaceDeer@fedia.io avatar

As I said, the "traditional" CDLs were also in a legal grey area. But once the publishers are suing IA for going full Library Genesis anyway, why not also include those?

I went back to one of the older articles I could find on this subject, from before the lawsuit was filed. Some particularly-relevant excerpts:

Until this week, the Open Library only allowed people to "check out" as many copies as the library owned. If you wanted to read a book but all copies were already checked out by other patrons, you had to join a waiting list for that book—just like you would at a physical library.

Of course, such restrictions are artificial when you're distributing digital files. Earlier this week, with libraries closing around the world, the Internet Archive announced a major change: it is temporarily getting rid of these waiting lists.

...

James Grimmelmann, a legal scholar at Cornell University, told Ars that the legal status of this kind of lending is far from clear—even if a library limits its lending to the number of books it has in stock. He wasn't able to name any legal cases involving people "lending" digital copies of books the way the Internet Archive was doing.

...

The legal basis for the Open Library's lending program may be even shakier now that the Internet Archive has removed limits on the number of books people can borrow. The benefits of this expanded lending during a pandemic are obvious. But it's not clear if that makes a difference under copyright law. "There is no specific pandemic exception" in copyright law, Grimmelmann told Ars.

Ironically the FAQ that Internet Archive put online has been taken down, but I found it in their Wayback Machine. It says:

The library will have suspended waitlists through June 30, 2020, or the end of the US national emergency, whichever is later. After that, waitlists will be dramatically reduced to their normal capacity, which is based on the number of physical copies in Open Libraries.

So it seems pretty clear to me that by "suspending waitlists" it means that they're going to "lend" more copies simultaneously than they actually have.

The Internet Archive had been poking a bear with a stick for years and the bear had been grumbling but not otherwise responding. So they decided to try giving it a whack across the nose with the stick instead. Normally I'd just sigh and shake my head at their stupidity, but they're carrying a precious cargo on their back while they're needlessly provoking that bear, and now they're screaming "oh no my precious cargo! Help me!" While the bear has a firm grip on their leg. That makes me extra frustrated and angry at them for doing this.

I'm not siding with the bear here, I should be very clear. The publishers are awful, the whole concept of copyright has become corrupt and broken, and so on and so forth. But the Internet Archive isn't supposed to be fighting this fight. They were supposed to be protecting that precious cargo, and provoking the bear is the opposite of doing that.

FaceDeer ,
@FaceDeer@fedia.io avatar

Have you ever tried Bing Chat? It does that. LLMs that do websearches and make use of the results are pretty common now.

FaceDeer ,
@FaceDeer@fedia.io avatar

Yes, but it shows how an LLM can combine its own AI with information taken from web searches.

The question I'm responding to was:

I wonder why nobody seems capable of making a LLM that knows how to do research and cite real sources.

And Bing Chat is one example of exactly that. It's not perfect, but I wasn't claiming it was. Only that it was an example of what the commenter was asking about.

As you pointed out, when it makes mistakes you can check them by following the citations it has provided.

FaceDeer ,
@FaceDeer@fedia.io avatar

Why do skilled professionals have less-skilled assistants?

FaceDeer ,
@FaceDeer@fedia.io avatar

Over the past month I feel like all I've been doing is writing tech design documents for systems I don't actually know anything about because I haven't had the opportunity to go in and do anything with them.

Fortunately I've finally managed to reach the point where everyone agrees that we should just start implementing the basics and see how that goes rather than try to plan it all out ahead of time since we're surely going to have to throw out the later plans once we see what we're actually dealing with.

FaceDeer ,
@FaceDeer@fedia.io avatar
FaceDeer ,
@FaceDeer@fedia.io avatar

Did you read the article? The popup warns users about it, yes. It's a good thing to let them know there won't be more security updates for their OS.

FaceDeer ,
@FaceDeer@fedia.io avatar

How is that Microsoft's fault? Should they be forcing users to care, somehow? The warning is already getting people angry as it is.

Don’t Be Fooled: Much “AI” is Just Outsourcing, Redux ( www.techpolicy.press )

The promise of AI, for corporations and investors, is that companies can increase profits and productivity by slashing their reliance upon a skilled human workforce. But as this story and many others show, AI is just today’s buzzword for “outsourcing,” and it comes with the same problems that have plagued outsourced...

FaceDeer ,
@FaceDeer@fedia.io avatar

AI is actually real, though, and can actually accomplish many of the things it's being used for. I think this article is focusing overly much on a couple of weird outlier situations.

FaceDeer ,
@FaceDeer@fedia.io avatar

You are perhaps confusing the highly-general term "AI" for the more-specific term "AGI". It's true that there's no real AGI out there yet, but AI has been around for many decades. LLMs are a type of AI.

FaceDeer ,
@FaceDeer@fedia.io avatar

Sure, but the fact that not all AI isn't really AI doesn't mean it isn't real. I run local LLMs on my home computer to perform various tasks, I can shut off my Internet entirely and they still work. There isn't some secret line out to a third-world sweatshop where outsourced labor is frantically typing responses to the thousands of queries generated by my scripts.

Training an AI is expensive and time consuming, but simply using one can be very straightforward.

FaceDeer ,
@FaceDeer@fedia.io avatar

You're still talking about training AIs, though. Using AIs doesn't require years of work and PhDs to research. You just sign a contract with one of the AI service providers and they give you an API. You may need to do a little scripting to hook up a front end and some fiddling with prompts and parameters to get the AI to respond correctly, but as I said above, I've done this myself in my own home. Entirely on my own, entirely just for fun. It's really not hard, I could point you to a couple of links for some free software you could use to do it yourself. Heck, even the training part isn't hard if you're starting with one of the existing open models and you've got the hardware for it.

Do you really think all those companies out there with chatbot "help staff" (that speak perfect English and respond faster than a well-trained typist could type) are most likely just outsourced workforce to some cheap foreign company? What is the hundreds of billions of dollars worth of computer hardware the AI service providers are running actually being used for, if not that?

FaceDeer ,
@FaceDeer@fedia.io avatar

Going back to my original comment:

Sure, but the fact that not all AI isn't really AI doesn't mean it isn't real.

The fact that Amazon was faking it in this one instance doesn't poof all the actual AI out of existence. There are plenty of off-the-shelf AI models that are good enough for various particular problems, they can go ahead and use them. You said it yourself, the chatbot "help staff" might be actual LLMs.

At that point you might just try to figure out how to offload the work someone else.

As I said, most companies using AI will likely be hiring professional AI service providers for it. That's where those hundreds of billions of dollars I mentioned above are going, where all the PhDs spending years on R&D are working.

FaceDeer ,
@FaceDeer@fedia.io avatar

Users "owning" their content in that way would be the instant death of the Fediverse. If anyone can put whatever nonsense license terms they want on each individual comment or post, how could that chaos possibly be federated?

A better approach would be to recognize that if you're posting your words up on a giant billboard you're not going to be able to control who sees them.

FaceDeer ,
@FaceDeer@fedia.io avatar

Plenty of non-capitalist organizations also seek to grow.

FaceDeer , (edited )
@FaceDeer@fedia.io avatar

Bing Chat has become my go-to search engine for situations where I'm not looking for a specific website or other such resource, and instead want some kind of information or knowledge. I'd recommend giving it a shot. It does a websearch in the background, puts the results into its hidden context, and then builds an answer for you based on the information it dredged up, complete with links. You can then clarify your question or ask for further details and get a back-and-forth going, it's really handy. I'd recommend giving it a shot, I believe it works without needing an account now.

Oh, I should note: don't use it like an old-school search engine where you just type a couple of keywords in. Be conversational and give context to your search. Say for example "I'm planting a garden in Witchita, Kansas. What climate zone is it, and what sorts of flowers grow well there?" And then perhaps follow that with "Are any of those attractive to hummingbirds?" Or whatever. That should help it figure out what information to look for and how to distill what you want to know from it.

FaceDeer ,
@FaceDeer@fedia.io avatar

I can't recall the last time I pirated anything executable (games and other software). There are legitimate free options for everything I've wanted, and executable code is just too risky.

FaceDeer ,
@FaceDeer@fedia.io avatar

There's an assumption in the comments that this is Lemmy-specific, so I figured I should also mention a tool I used recently when copying subscriptions from a kbin instance to an mbin instance.

FaceDeer ,
@FaceDeer@fedia.io avatar

Well, at least we've moved from "Meta is Satan! Defederate!!1!" to "They may mean well now but they'll turn evil later."

FaceDeer ,
@FaceDeer@fedia.io avatar

I'm not a frequent user myself so I'm probably not the best to answer on the usability front, but for the combination of high TPS and low price volatility I'd probably recommend using one of Ethereum's stabletokens (DAI, USDT, etc.) on one of its layer-2 networks (such as Arbitrum or Optimism). Stabletokens are cryptocurrencies whose value has been tied to some external measure, in most cases the US Dollar, so they're ideal for use in commerce.

FaceDeer ,
@FaceDeer@fedia.io avatar

I wouldn't recommend using Bitcoin specifically for an e-commerce site like this, both because of its volatility and its high transaction fees. A stabletoken like DAI or USDT is more specifically designed for this use case.

FaceDeer ,
@FaceDeer@fedia.io avatar

No, you misunderstand the point of a stabletoken. They are designed to have a fixed value, usually tied to the US dollar. Is the US dollar a "casino chip"? That's what these sites usually price things in to begin with.

FaceDeer ,
@FaceDeer@fedia.io avatar

Then isn't Gumroad already selling goods in exchange for "casino chips?" What alternative would you suggest?

FaceDeer ,
@FaceDeer@fedia.io avatar

Currently it's handling about 140 TPS. The Dencun upgrade to Ethereum that just went live a couple of days ago adds a new feature, data blobs, that lets this go significantly higher at reduced cost.

FaceDeer ,
@FaceDeer@fedia.io avatar

Ah, we're doing one of those full circle things. I actually remember the time when AOL was "the internet."

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • All magazines