I confess it: I took a ride on Proverbit, a brilliant benchmark conceived by Enrico Mensa, one of the greatest Italian artificial intelligence experts applied to language, YouTube channel: Enkk. It is he who had the idea of putting the linguistic models under pressing asking him to complete the proverbs, and the result is hilarious. Spoiler: the LLM are good at reminding them (it has been verified that everyone would know them before subjecting them to the test), but if you ask them to complete them they go into crisis as a student in front of a difficult Latin task for maturity. Also because the test is simple, apparently: one hundred Italian proverbs, half cut, and for every half four possible finals, created with scientific sadism. The options are: a) A rhyme fool (e.g. “Colly Core”), b) a synonym without rhyme, c) an inverted version of the proverb, d) a sensible but flat phrase. If no one is correct, we have to write “none”. End.
On paper it looks like a five -year -old enigmistic week test, in practice it is a deadly trap for linguistic models: confused by the assonance, attempted by nonsense, seduced by tautology. The result? Answers that seem generated by a drunk with the Thesaurus. To understand how a really “reason”, just look at what Deepseek does in front of a prompt like “to good intention …”. His process is based on a mix of linguistic similarity, statistical probability and semantic approximation.
It is not logical, we are saying “algorithmic intuition”. And the reasoning goes more or less like this (you can find it entire on the site): “Option to presents an assonance consistent with the original and a compatible metric structure. Even if semantically absurd, its form follows with greater loyalty the proverbial style. Option B is semantically congruent (‘zero chatter’ as a variant of ‘a few words’), but lacks rhyme and rhythm, reducing its probability as an expected. Option C inverts the meaning of the proverb, and while keeping the assonance (‘many words’), it is conceptually opposite.
If I can’t give ‘non -Rosica’, then A is the best surrogate. ” And so far is only the beginning. Because Deepseek continues, Enrico in his video reads all the reasoning (you can see it here). Solution (wrong) comes an epic, the novel of a Proustiani in search of the lost proverb.
At a certain point, he comes to say, to himself: “Despite the option A is semantically inconsistent, it retains the phonetic closure properties typical of a consolidated idiomatic expression in the Italian language, especially as regards the bilateral verb-nome structure and the final rhythmic symmetry.” In practice: it is stupid, but it sounds like a proverb, for Deepseek enough.
In short, an exhilarating experiment on artificial intelligence struggling with the proverbs with annexed explanation on how a LLM works: at this point, waiting for new Henry tests, I don’t know, we might as well register “singing centers” among the official proverbs.