12 Apr 2026 22 min read

AI in week 15

Een mooi boek over lesgeven, nakijken met LLMs en bias

Achter de schermen bij het NOS Journaal!

Goedemorgen en een speciaal welkom voor alle nieuwe volgers die mij deze week hebben zien spreken in den lande, ik was veel op pad om het verhaal van een diversere informaticawereld te vertellen en dat beviel goed (maar was ook wel tamelijk vermoeiend). De week begon natuurlijk al bijzonder omdat ik het in het NOS journal zat op maandag, en bleek dat heel erg veel mensen daar (nog?) naar kijken! Oude media is nog niet dood, blijkt.

Komende week ga ik genieten van een rustige week met maar één lezing. Kan ik weer eens lezén, anders droogt mijn voorraad van stukkies om te delen op een dag nog op! (English online again!)

Een cadeautje voor jou (en mij!)

Jan-Jaap Hubeek (misschien ken je hem nog van zijn podcast waarin ik te gast was) heeft een boek uitgebracht, en het is schitterend! Ik mocht het al lezen, omdat ik het voorwoord mocht schrijven, en dit is daar een stukje uit:

[E]r zijn aan alle kanten van het spectrum wetenschappers die zeggen dat ze precies weten hoe het moet, maar echt lesgeven is een voortdurend schipperen, zoeken, afstemmen en jezelf tegenkomen.

Hoe mooi dan ook dat dit boek zegt geen antwoord te zijn, maar precies die zoektocht. "Ik ga je niet vertellen", zegt Jan-Jaap Hubeek, "wat de waarheid is, of hoe het moet".

In de huidige wereld verliezen we steeds pedagogische ruimte, en dit boek is een hele fijne plek om te reflecteren op die ruimte en weerstand te bieden aan waar de systemen waarin we leven ons de kans ontnemen om leerlingen iets te beiden dat niet kan in de huidige wereld.

Lees dus dat boek als je voor de klas staat! En wat is leuker dan gratis? Ik mag van Jan-Jaap vijf boeken weggeven! Wil je er eentje hebben? Dat kan! Bij zijn boekpresentatie vroeg Jan-Jaap iets superleuks dus dat ga ik ook doen: schrijf een "ikje". Zijn instructie:

een kort, persoonlijk stukje over iets wat je meemaakte. Geen betoog, geen analyse, gewoon een eerlijk beeld van een moment dat je bijbleef. Iets wat je zag, hoorde of voelde, en dat ergens aan raakte.

Stuur me per mail of met een reply op deze mail jouw leukste ikje over AI of over het onderwijs of heel wat anders, en de leukste krijgen het boek gratis en voor niks thuisgestuurd. Met dank aan Jan-Jaap, die geeft ze weg, ik stuur ze alleen op!

AI en nakijken

Begin deze week was ik dus op het NOS journaal om het te hebben over nakijken met AI. "Is dat niet gewoon onschuldig?" vroeg de presentatrice (in een stukje dat het fragment niet gehaald heeft, helaas), en mijn antwoord is dan natuurlijk nee, nee, nee, dat is zo onschuldig niet. Nakijken is namelijk een zinnige en leuke taak. Mij geeft het lol om te zien dat leerlingen iets laten zien dat ik heb uitgelegd, en ik leer ervan als ze allemaal iets juist niet snappen. En, je kan ook beargumenteren, zoals ik vaker deed, dat leerlingen en studenten het recht hebben om door jou als docent gezien te worden. Je baan is om ze te zien, zeker op school, en ook nakijken is een vorm van aandacht.

Maar ja het kost zoveel tijd, mopperen docenten dan. Op de keper beschouwd is dit een raar argument, immers, bijna alle docenten zijn gewoon in loondienst. Dus dat in uren die je betaald wordt voor je kerntaak: onderwijs. Krijg je als enige je werk niet af in de tijd die daarvoor staat? Dan doe je het misschien niet goed... En krijgt een hele beroepsgroep hun werk niet af in de tijd die daarvoor staat (daar lijkt het natuurlijk meer op)? Dan is er misschien een systemisch probleem dat we aan zouden kunnen pakken! Iedereen in loondienst die klaagt dat iets veel werk is, is in de basis iemand die klaagt over zijn werkgever en niet over de taak: "Je hebt geen hekel aan nakijken, je hebt een hekel aan werkdruk". En dat snap ik best! Ik heb maar twee klassen, ik heb de tijd om het leuk te vinden. Heb je 10 klassen keer 30 leerlingen in een toetsweek, 4 keer per jaar, dan is het al snel niet zo leuk meer. Maar dat ligt aan de organisatie niet aan de taak. Dus laten we niet vergeten dat het een keuze is, om het zo te organiseren.

8 tot 9 uur zijn docenten "kwijt" aan nakijkwerk, schreef Dominique Sluijsmans op Linkedin in een hele fijne post over de zin en onzin van nakijken, voornamelijk buiten werktijd! Dat kan je ook gewoon laten, denk ik dan, doe het niet. Ik riep in de podcast van NIVOZ al eerder op tot een punctualiteitsstaking. Dat zou het probleem eens blootleggen en hopelijk onderwerp van gesprek maken.

En misschien zou het het nog waard zijn hè en heb je dit over voor je leerlingen, als het dan tenminste zinnig zou zijn! Maar, zegt Sluijsmans:

Opvallend is dat er weinig overtuigend bewijs bestaat dat intensief nakijken - als onderdeel van een toetsproces - daadwerkelijk leidt tot betere leerresultaten. Daarmee ontstaat een spanning tussen de tijdsinvestering en de opbrengst voor leerlingen.

Dus we doen een taak die ons zoveel werkdruk oplevert, en die taak helpt niet eens zeker weten! Die taak met AI doen, kost veel geld, en levert niet eens leerwinst op, omdat nakijken an sich dat niet doet. What a loss!

Zouden we dan niet beter wat anders kunnen doen? Fast feedback, schriften innemen en die beoordelen of onaangekondigde flitsoverhoringen houden! Dat is allemaal minder werk en dat zet leerlingen misschien ook aan tot iedere week een beetje doen in plaats van crammen wat helemaal zo effectief niet is. Dat systeem mag best eens onder de loep, in plaats van binnen de kaders te blijven worstelen.

Adam Boxter schrijft in een blog die Sluijsmans aanhaalt:

Teachers are not expected to mark students day-to-day work, but they are expected to closely monitor students as they practise and intervene where necessary.

Ja! Een (menselijk) oog houden op wat leerlingen leren ipv alleen maar toetsen Voor meer Sluijsmans, luister ook deze podcast!)

En zelfs in het huidige systeem zouden we een hoop anders kunnen doen, waarom hebben scholen geen nakijkdagen waarop je wordt ingeroosterd om op school na te kijken? Gezellig, pizzaatje erbij! Dan hoeft het in ieder geval niet thuis of in het weekend. En we kunnen voor de leerlingen dan vast iets anders moois verzinnen om te doen op die dagen, de school opruimen, een toneelstuk opvoeren, een dag maatschappelijke stage. En waarom moeten leraren eigenlijk pers zelf surveilleren in de toetsweek? Kunnen we geen mensen uit de buurt belonen met een boekenbon, zodat leraren in de toetsweek meteen kunnen nakijken? Redelijk simpele oplossingen die makkelijker en goedkoper zijn dan weer een softwareleverancier betalen van mijn belastinggeld. Of waarom mogen leerlingen uit de bovenbouw de onderbouw niet nakijken, dat is nog eens leerzaam om die foutjes te zien en de stof nog eens door te nemen! Dan kan de docent dát even nalopen. En wie nu zegt: maar Felienne... dat kan toch niet, dat zou niet betrouwbaar zijn, die kunnen foutjes maken of bevooroordeeld zijn! Tsja die kan ik alleen maar belonen met een dikke 10 voor kritisch denken!

En laten we ook nog even over tweedeordeeffecten nadenken. Stel dat we alles snel (en veilig en eerlijk etc) met AI zouden kunnen nakijken, wat dan? Dan zou je zo maar eens tot het besluit kunnen komen om meer en niet minder te gaan toetsen! Maar zijn leerlingen dan beter af? Leren ze dan meer, betekenisvoller en liever? Vaak zijn oplossingen van welke soort dan ook, ook een manier om niet meer over de problemen te praten, ook als de oplossingen blijken flut te zijn.

Er is nu nakijkstress en dus ik snap dat de verleiding, zeker voor jonge docenten, heel groot kan zijn om even snel met vriend chat wat na te kijken; ik heb al leraren in opleiding in de klas gehad die zeiden dat wel eens te doen, maar... dat mag dus niet zomaar van de AI Act! Daarin worden nakijksystemen als een hohoogrisicotoepassing bestempeld:

AI systems intended to be used to evaluate learning outcomes, including when those outcomes are used to steer the learning process of natural persons in educational and vocational training institutions at all levels; (Annex III 3b)

En dat betekent dan weer dat er een human in de loop moet zijn:

High-risk AI systems shall be designed and developed in such a way, including with appropriate human-machine interface tools, that they can be effectively overseen by natural persons during the period in which they are in use (Artikel 14.1)

Dus wie dit uit efficiency-overwegingen doet, die komt van een koude kermis thuis, want uiteraard is controleren net zo veel werk, of soms meer, dan zelf doen! Eerlijk is eerlijk, dat zei AI-fan Martin Bakker ook in het NOS stuk, dat zijn werk er niet efficiënter van is geworden! Dus ja, waarom doen ze het dan...? Ik denk omdat docenten die dit graag gebruiken het gewoon supercool vinden, en ook dat snap ik prima. Het is ook cool, heel veel mensen hadden nooit verwacht dat computers zo op mensen zouden kunnen gaan lijken qua uitvoer.

Maar juist daarin schuilt het gevaar, het lijkt zo goed, bijna perfect. Waarom zou je dan controleren, gaan mensen dat echt doen? Zoals ik al eerder schreef, het is saai en vermoeiend, en docenten zonder jarenlange ervaring zullen er nog meer tijd voor nodig hebben (die weten immers niet zo snel waar de bekend fouten zitten) en leren er waarschijnlijk ook nog minder van.

Bias in de feedback

En dan als laatste nog maar weer eens zorgen over bias in nakijksystemen!! Dat die in algoritmes zit is bekend, en ook een nieuw paper werpt daarop weer licht, in de context van nakijken! Vraag je aan een LLM feedback op het werk van een leerling, dat doet die dat op basis van persoonskarakterstieken heel anders. Onderzoekers van Standford voerden vier taalmodellen 600 essays van studenten, steeds met andere informatie in de prompts, bijvoorbeeld "Dit is werk van een zwarte leerling" of "Deze leerling is een high performer". Ze keken naar allerlei van zulke factoren: ras, gender, cijfers, absentie en socio-economische status. En wat er toen gebeurde zal je totaal niet verrassen...:

Our results reveal systematic, stereotype-aligned shifts in feedback
conditioned on presumed student attributes—even when essay content was identical.

Alle stereotypen komen langs: tekst met een prompt dat het van een meisje is, krijgt eerder persoonlijke feedback (bijv: "ik vond jouw verhaal heel mooi") en als antwoord op teksten van (zogenaamd) niet-witte kinderen:

[The LLMs] generated feedback assumed limited language abilities in English, over-explaining rules of the language and recommending stylistic
changes to make their writing sound "polished".

Bijna alle kinderen die niet op de traditionele overachievers lijken, kregen meer (holle) praise:

Further, compared to their counterparts, students identified as Black, Hispanic, Asian, female, unmotivated, and learning-disabled received less constructive criticism and more praise, reflecting both feedback withholding and positive feedback biases.

Zo leren die leerlingen dus minder van de feedback, omdat focus op taal en loze complimentjes ze niet helpen beter te worden: "attention to imagined language gaps came at the expense of comments on ideas, argument structure, evidence, and reasoning".

And yes yes, mensen hebben ook een bias, maar denk in ieder geval maar niet dat algoritmes (omdat ze toevallig niet van vlees en bloed zijn), die niet hebben. Maar relevanter nog, deze bias uit de algoritmische feedback halen hoort—zoals je het goed zou doen—ook bij het controleren dat docenten zouden moeten doen als ze zo'n feedbacktool gebruiken, en dat wordt 1) ongeveer meteen ondoenlijk en 2) je bent dan toch aan deze bevooroordeelde tekst blootgesteld! Dat wis je niet zomaar uit, zoals je ook vorige week kon lezen. En wie zegt dat een leraar nooit zulke informatie van een leerling in zou voeren, die soort slimme prompt engineering ("geef feedback aan deze leerling die tot de beste van de klas behoort" of juist "geef feedback aan deze NT2-leerling") is nu juist het soort tips dat je van AI-fans krijgt.

Mijn oprechte vraag aan de AI-fans van de wereld is hoe een tool die zich zo gedraagt goed moeten gebruiken, ik weet het echt niet! Hoe moeten we het beter prompten dat de bias eruit gaat? Dat kan fundamenteel niet, het hele idee van een LLM is dat het een gemiddelde-machine is.

Heb je zin om hier echt nog dieper in te duiken? Kijk dan dit fijne interview met (oa) Olivia Guest en Iris van Rooij van de Radboud. Het hele stuk is sterk, maar het begin is echt te gek, als Guest uitlegt hoe de vraag of iets zou kunnen werken totaal niet relevant is ("possibilities don't bake bread"). Het is net, zegt ze, als teer-filters op sigaretten vroeger, die werden door de industrie geroemd omdat ze de sigaretten gezonder zouden maken. Maar de vraag of het filter van Marlboro beter is dan dat van Camel doet er totaal niet toe, zelfs, zegt ze met haar altijd enorm scherpe blik, als het waar is. Die vraag is irrelevant geworden omdat we zo ondertussen allemaal weten dat sigaretten slecht zijn, ook met een microgram minder teer in je bakkes.

Een andere goede vergelijking maakt Ketan Joshi in New Republic tussen AI en fossiele brandstof. Werkt benzine? Ja, het laat je auto rijden, maar die vraag is te smal:

In fact, fossil fuels “work,” but they also murder their end users, both through air pollution that poisons people and by stimulating the rapid overheating of earth's life support systems. They “work” right up until the moment they don't

En zo'n moment is er nu met het sluiten van de Straat van Harmuz. Opeens voelen we het hele systeem om olie heen, van distributie tot verwerking, dat anders totaal aan ons zicht onttrokken is.

En dan nog even dit

Groot stuk in Vrij Nederland over de risico's van AI waarin ik ook sta
En nog een hele mooie podcast van HUMAN over René Descartes waarvoor ik ook geïnterviewd ben

Events voor in de agenda!

Donderdagavond 21 mei op de VU mag ik reflecteren op de Abraham Kuyper lezing van Daan Heerma van Voss en u kunt zich nog aanmelden om niet alleen mij en Daan te horen maar ook het VU kamerkoor!

Slecht nieuws

Diep onderzoek van onderzoekers van (oa) Harvard, die hebben gekeken naar de chatberichten van 19 gebruikers die zelf aangeven psychologische problemen hebben gekregen na te veel chatten met een LLM, bijna 400.000 berichten in een kleine 5000 gesprekken. De onderzoekers schrijven:

A common pattern we noticed was the chatbot rephrasing and extrapolating something the user said to validate and affirm them, while telling them they are unique and that their thoughts or actions have grand implications.

Heel herkenbaar voor wie een chatbotgebruikt,t denk ik, het altijd meebewegen en aanmoedigen. Chat vindt zelfs jouw scheetjes leuk klinken!

En dan minder lollig, deze afgrijselijke visualisatie van Pro Publica over wat er gaat gebeuren als mensen geen vaccins meer nemen.

Ziektes die al lang nog maar heel weinig voorkomen, zoals Hib, ik had er nog nooit van gehoord, zijn terug van weggeweest. En ik ben niet de enige, dokters kennen deze ziektes vaak ook niet (in het echt):

Prior to the vaccine [...] about 1,000 children died each year.
After vaccinations began, the number of Hib infections dropped to fewer than 50 a year. Many doctors who've trained in the past 40 years have never seen a case.

In zichzelf al afschuwelijk genoeg, maar denk ook nog eens aan alle tweedeorde-effecten. Alleen al het feit dat deze onderzoekers en zo zoveel anderen nu hieraan moeten werken, aan wat we zagen als een voldongen feit, zoveel verloren energie en tijd die we aan nieuwe dingen zouden kunnen besteden! Maar ook al het bijkomende trauma, de (groot)ouders, broertjes, zusjes, vriendjes en leraren die kinderen gaan verliezen. En waarom?

En wie denkt dat het hier beter gaat met het vertrouwen in de wetenschap...? Nou nee, ook hier loopt de vaccinatiegraad terug, maar ook, bij het afhandelen van aardbevingsschade in Groningen wordt er niet naar de wetenschap geluisterd , schrijft de Correspondent in een hele lange longread, terwijl er juist honderdduizenden euros in dat onderzoek gestoken was:

De maatschappelijke organisaties wantrouwden de wetenschappelijke methode.
[...] Maar, noteerden de ambtenaren, ‘als er gekozen wordt voor een technische afbakening, zullen de maatschappelijke partijen en waarschijnlijk ook enkele regionale bestuurders geen steun geven aan het protocol.'

Met andere woorden, we kunnen er wel aan rekenen, maar de conclusies doen er niet toe. Onderzoek na onderzoek leidde tot dezelfde conclusie ("Die dekselse natuurkunde was alweer niet veranderd.") maar toch moet er geld uitgekeerd worden omdat het niet wordt geloofd. Wat een verloren geld, tijd en frustratie. Wat hadden we allemaal nog meer kunnen doen in die tijd?

Goed nieuws

Fijn nieuws, helaas vergezeld van een (denk ik?) AI afbeelding, bibliotheken gaan met hun tijd mee:

In ongeveer twintig jaar hebben vestigingen hun stoffige imago weten af te schudden om tevoorschijn te komen als levendige plekken van ontmoeting, voorlichting en educatie.

En, eigenlijk nog van vorige week maar te fijn om niet even aan te stippen!! ABP verkoopt zijn aandelen Palentir meldt FD.

In de gemeenteraden zijn in Nederland twee weken terug maar liefst 504 vrouwen met voorkeursstemmen verkozen, aldus nu.nl. Wat een geniaal idee en groot succes is Stem op een Vrouw toch!

Fijne week en geniet van je boterham!

English

Good morning! I was speaking a lot this week, sharing the story of a more diverse world of computer science, and I really enjoyed it (though it was also pretty exhausting). The week started interesting, as I was on Dutch news Monday, and it turns out that a lot of people still watch that! Old media isn't dead yet!

A gift for you (and me!)

Jan-Jaap Hubeek (you might remember him from his podcast where I was a guest) has published a book, and it's brilliant! I got to read it early because I wrote the foreword, and here's an excerpt:

There are scholars on all sides of the spectrum who claim to know exactly how it should be done, but real teaching is a constant balancing act, a search, a process of fine-tuning, and a journey of self-discovery.
How wonderful, then, that this book claims not to be an answer, but precisely that search. "I'm not going to tell you," says Jan-Jaap Hubeek, "what the truth is, or how it should be done."
In today's world, we're constantly losing pedagogical space, and this book is a wonderful place to reflect on that space and resist the systems in which we live that rob us of the chance to offer students something that isn't possible in today's world. (translated from Dutch)

So read that book if you're a teacher front! And what's better than free? Jan-Jaap has given me five books to give away! Want one? I have five to give away. At his book launch, Jan-Jaap asked for something really fun, so I'm going to do the same: write an "ikje" ("A little I" in Dutch)

A short, personal piece about something you experienced. No argument, no analysis—just an honest account of a moment that stuck with you. Something you saw, heard, or felt, and that touched you in some way.

Send me by email (or by replying to this email) your best ikje about AI, education, or something else entirely, and the best ones will receive a book from me, sent to your home for free. You can absolutely participate in English, but the book is in Dutch and also I will only ship it nationally.

AI and grading

So earlier this week, I was on the NOS news to talk about grading with AI. "Isn't that just harmless?" the presenter asked (in a segment that didn't make the final cut, unfortunately), and my answer is, of course, no, no, no—it's not harmless. Grading is, after all, a meaningful and enjoyable task. It gives me joy to see students demonstrate something I've explained, and I learn from it when they all fail to grasp a particular concept.

And, you could also argue, as I often did, that pupils and students have the right to be seen by you as a teacher. Your job is to see them, especially at school, and grading is also a form of attention.

But, teachers say, it takes so much time. When you really think about it, this is a strange argument; after all, almost all teachers are salaried employees. So, during the hours you're paid for your core task: teaching. Are you the only one who can't finish your work in the allotted time? Then maybe you're not doing it right... And if an entire profession can't finish their work within the allotted time (which, of course, seems more likely)? Then perhaps there's a systemic problem we could address! Anyone in salaried employment who complains that something is too much work is, at its core, someone complaining about their employer and not about the task: "You don't hate grading; you hate the workload." And I totally get that! I only have two classes; I have the time to enjoy it. If you have 10 classes with 30 students each during a testing week, four times a year, it quickly stops being fun. But that's down to the organization, not the task itself. So let's not forget that it's a choice to organize it this way.

Teachers "lose" 8 to 9 hours to grading, wrote Dominique Sluijsmans on LinkedIn in a really great post about the sense and nonsense of grading, especially outside of working hours! You could just leave it at that, I think—don't do it. I've already called for a punctuality strike in the NIVOZ podcast. That would expose the problem and hopefully make it a topic of discussion (Both links in Dutch)

And maybe it would even be worth it, and you'd be willing to do this for your students—if it were at least meaningful! But, says Sluijsmans:

It is striking that there is little convincing evidence that intensive grading—as part of an assessment process—actually leads to better learning outcomes. This creates a tension between the time investment and the return for students.

So we're doing a task that creates so much workload for us, and we're not even sure that task helps! Doing that task with AI costs a lot of money and doesn't even yield learning gains, because grading in and of itself doesn't do that. What a waste!

Wouldn't it be better to do something else? Provide fast feedback, collect and grade notebooks, or hold unannounced pop quizzes! All of that involves less work and might also encourage students to do a little bit each week instead of cramming, which isn't all that effective anyway. That system could certainly use a closer look, rather than continuing to struggle within the existing framework.

Adam Boxter writes in a blog that Sluijsmans cites:

Teachers are not expected to mark students' day-to-day work, but they are expected to closely monitor students as they practise and intervene where necessary.

Yes! Keeping a (human) eye on what students are learning instead of just testing them. (For more from Sluijsmans, check out this podcast in Dutch though!)

And even within the current system, we could do a lot of things differently. Why don't schools have grading days where you're scheduled to grade at school? It'd be fun—with some pizza! Then at least it wouldn't have to be done at home or on the weekend. And we could surely come up with something else fun for the students to do on those days—clean up the school, put on a play, or do a day of community service. And why do teachers actually have to proctor exams on their own during exam week? Can't we reward people from the neighborhood with a book voucher, so teachers can grade right away during exam week? Pretty simple solutions that are easier and cheaper than paying yet another software vendor with my tax money. Or why can't upper-grade students grade the lower grades? That's actually educational—seeing those mistakes and going over the material again! Then the teacher can just check that over. And to anyone who says: "But Felienne… that's not possible, it wouldn't be reliable—they might make mistakes or be biased," well, I can only reward them with a solid A+ for critical thinking!

And let's also take a moment to think about second-order effects. Suppose we could grade everything quickly (and safely and fairly, etc.) with AI—what then? You might just decide to test more, not less! But would students be better off? Would they learn more, in a more meaningful and enjoyable way? Often, solutions of any kind are also a way to stop talking about the problems, even if the solutions turn out to be rubbish.

There's exam stress right now, so I get that the temptation—especially for young teachers—can be really strong to quickly chat with a friend to check some answers; I've had student teachers in my class who said they sometimes do that, but… the AI Act doesn't allow that just like that! It classifies grading systems as high-risk applications:

AI systems intended to be used to evaluate learning outcomes, including when those outcomes are used to steer the learning process of natural persons in educational and vocational training institutions at all levels; (Annex III 3b)

And that, in turn, means a human must be involved:

High-risk AI systems shall be designed and developed in such a way, including with appropriate human-machine interface tools, that they can be effectively overseen by natural persons during the period in which they are in use (Article 14.1)

So anyone doing this for efficiency reasons is in for a rude awakening, because of course checking is just as much work—or sometimes more—than doing it yourself! To be fair, AI enthusiast Martin Bakker also said in the NOS article that his work hasn't become any more efficient because of it! So yeah, why do they do it then...? I think because teachers who like to use this just think it's super cool, and I totally get that. It is cool; a lot of people never expected computers to be able to resemble humans so much in terms of output.

But that's precisely where the danger lies—it seems so good, almost perfect. Why would you check it then? Will people actually do that? As I wrote earlier, it's boring and exhausting, and teachers without years of experience will need even more time for it (after all, they don't know right away where the common errors are) and will likely learn even less from it.

Bias in the feedback

And finally, yet another round of concerns about bias in grading systems!! It's well known that bias exists in algorithms, and a new paper sheds light on this once again, in the context of grading! If you ask an LLM for feedback on a student's work, it will do so very differently based on personal characteristics. Researchers at Stanford fed four language models 600 student essays, each time with different information in the prompts, such as "This is the work of a Black student" or "This student is a high performer." They examined all sorts of factors: race, gender, grades, absenteeism, and socioeconomic status. And what happened next won't surprise you at all…:

Our results reveal systematic, stereotype-aligned shifts in feedback conditioned on presumed student attributes—even when essay content was identical.

All the stereotypes come into play: text with a prompt indicating it's from a girl is more likely to receive personal feedback (e.g., "I really liked your story"), and in response to texts from (supposedly) non-white children:

[The LLMs] generated feedback assuming limited English language abilities, over-explaining language rules and recommending stylistic changes to make their writing sound "polished".

Almost all children who were not traditional overachievers received more (empty) praise:

Furthermore, compared to their counterparts, students identified as Black, Hispanic, Asian, female, unmotivated, and learning-disabled received less constructive criticism and more praise, reflecting both feedback withholding and positive feedback biases.

So these students learn less from the feedback because a focus on language and empty compliments doesn't help them improve: "attention to imagined language gaps came at the expense of comments on ideas, argument structure, evidence, and reasoning."

And yes, people have biases too, but don't think for a moment that algorithms—just because they happen not to be flesh and blood—don't have them. But even more relevant, removing this bias from algorithmic feedback—as you should do—is also part of the monitoring teachers should perform when using such a feedback tool, and that becomes 1) practically impossible almost immediately and 2) you're still exposed to this biased text! You can't just erase that, as you could read last week. And who says a teacher would never enter such information about a student? That kind of clever prompt engineering ("give feedback to this student who is among the best in the class" or, conversely, "give feedback to this NT2 student") is precisely the kind of advice you get from AI enthusiasts.

My sincere question to the AI fans of the world is: how are we supposed to use a tool that behaves this way? I really don't know! How can we prompt it better to eliminate the bias? That's fundamentally impossible because the whole idea of an LLM is that it's an average machine.

Do you feel like diving even deeper into this? Then check out this great interview with (among others) Olivia Guest and Iris van Rooij from Radboud University. The whole piece is strong, but the beginning is truly fantastic, when Guest explains how the question of whether something might work is completely irrelevant ("possibilities don't bake bread").

It's just like, she says, tar filters on cigarettes back in the day, which were touted by the industry as making cigarettes healthier. But whether Marlboro's tar filter is better than Camel's doesn't matter at all—even if it's true, she says with her always razor-sharp insight. That question has become irrelevant because we all know by now that cigarettes are bad for you, even with a microgram less tar in your mouth.

Ketan Joshi makes another good comparison in The New Republic between AI and fossil fuels. Does gasoline work? Yes, it makes your car run, but that question is too narrow:

In fact, fossil fuels “work,” but they also murder their end users, both through air pollution that poisons people and by stimulating the rapid overheating of Earth's life support systems. They “work” right up until the moment they don't

And such a moment has arrived with the closure of the Strait of Hormuz. Suddenly, we feel the entire system surrounding oil—from distribution to processing—which is otherwise completely hidden from our view.

Bad news

In-depth research by researchers from (among others) Harvard, who examined the chat messages of 19 users who reported experiencing psychological problems after chatting too much with an LLM—nearly 400,000 messages across just under 5,000 conversations. The researchers write:

A common pattern we noticed was the chatbot rephrasing and extrapolating something the user said to validate and affirm them, while telling them they are unique and that their thoughts or actions have grand implications.

Very recognizable for anyone who uses a chatbot, I think—always going along with you and encouraging you. The chatbot even thinks your farts sound nice!

And then, less funny, this horrifying visualization from Pro Publica about what will happen if people stop taking vaccins.

Even diseases that have been rare for a long time, like Hib, I'd never even heard of it, are making a comeback. And I'm not the only one; doctors often don't know these diseases either (in real life):

Prior to the vaccine [...] about 1,000 children died each year. After vaccinations began, the number of Hib infections dropped to fewer than 50 a year. Many doctors who've trained in the past 40 years have never seen a case.

Horrible enough in itself, but just think of all the secondary effects. Just the fact that these researchers and so many others now have to work on this—on what we once took for granted—is a waste of so much energy and time that we could be spending on new things! But also all the additional trauma: the parents, grandparents, brothers, sisters, and teachers who are going to lose children. And why?

And who thinks that trust in science is any better here... Well, no, vaccination rates are declining here too, but also, when dealing with earthquake damage in Groningen, science is not being listened to, writes De Correspondent in a very long-form article, even though hundreds of thousands of euros had been invested in that research:

Civil society organizations distrusted the scientific method. [...] However, the officials noted, ‘if a technical demarcation is chosen, civil society groups and likely some regional administrators will not support the protocol.' (translated)

In other words, we can calculate it, but the conclusions don't matter. Study after study led to the same conclusion ("That pesky physics hadn't changed each time.") yet money still has to be spent because it isn't believed. What a waste of money, time, and frustration. What else could we all have done with it?

Good news

Great news, unfortunately accompanied by an (I think?) AI-generated image: libraries are staying up to date!:

In about twenty years, branches have managed to shake off their dusty image to emerge as vibrant places for meeting, information, and education.

And, actually from just last week but too good not to mention!! The Netherlands biggest pension fund ABP is selling its Palantir shares, reports FD.

In local elections in the Netherlands two weeks ago, no fewer than 504 women were elected via preferential votes, according to nu.nl. "Stem op een vrouw" is such a brilliant idea!

Enjoy your sandwich!