Ex unibus plurum - "wuggen" revisited

In my last post I stumbled, in passing, on an idea; some of you may have noticed the "hmmm". I thought it might take the form of an update, but it has turned out to be a bit more substantial than that. The idea was to quantify the different ways English has of forming a plural. When I wrote this post nearly two years ago it didn't occur to me to wonder. I was content to say
English has lots of ways of pluralizing a noun – no change (sheepfish...), change -us to -i (radius → radii...), add -en (ox → oxen [or do something else involving '-en' {child → children, brother →  brethren...}]), change -ex or -ix to -ices (matrix → matrices) etc, but by far the most common device is to add an s (though this simple idea hides several options [/s/ {rabbits}, /z/ {gardeners}, /ɪz/ {radishes}]. What is the word for 'more than one wug'? Wugs, of course, with /gz/.
It's fairly obvious to a  native speaker that the most common way is to  add an s. In fact, this rule becomes apparent whenever a young language learner mistakenly adds an s to an irregular plural – sheep becomes sheepS rather than sheep, for example, and when an adult corrects mouseS to mice, the compulsion to keep faith in the add-an-s rule is so strong that the next attempt is quite often miceS.

But I wondered how I could put a number on that. The obvious source of data seemed to me to be the British National Corpus, though it is relatively small, at a mere 100,000,000 words. Some of the publicly available corpuses...
<I_know_ I_know subject="corpora">
There are people  who  say that corpora is the"correct" plural; some readers may have had the misfortune of being taught by someone who believed so; Firefox is trying to correct my spelling.  The latinate plural is not wrong, but I adhere to Fowler's belief (in The King‘s English)
 ...that all words not English in appearance are in English writing ugly and not pretty, and that they are justified only (1) if they afford much the shortest or clearest, if not the only way to the meaning ... or (2) if they have some special appropriateness of association or allusion in the sentence they stand in.
Elsewhere (maybe Modern English Usage) he gives the advice that you‘re less likely to make an embarrassing mistake (like mistaking a latinate -us word for a second declension noun instead of a 3rd [such as corpus] or 4th declension one [syllabus, for example], and giving it an -i ending), and more likely to be understood, if you use a native English s plural ending whenever it's possible.
</I_know_ I_know>
... have many more.

It is possible in principle to construct a query that requires a search engine to return all the nouns in it that end with a certain string. But accompany me, if you will, in a thought experiment. Suppose for the sake of argument that in any text in the corpus the percentage of nouns  is N%.
A few examples, followed by the count of plural  nouns:
  1. The cat sat on the mat.                               N=0 
  2. There is a tide in the affairs of men...      N=1
  3. Softly softly catchee monkey.                   N=0         
  4. The wages of sin is death.                         N=1
  5. When shall we three meet again?            N=0
  6. Honey I shrunk [sic] the kids                    N=1
  7. Where have you been all day?                 N=0
In this mini-corpus (perhaps I should make that nano-corpus) there are 3 plural nouns out of a total of 50 words. They're not too common, plural nouns; 6% in this case, though in for example a recipe book the figure would be much higher. In BNC, that would be 6%  of 100,000,000 – 6,000,000. 

This is admittedly  a VERY dodgy sample; but my point is that even a tiny value for N leads to a big number in a corpus such as BNC. 

At the British National Corpus I asked for all the plural nouns that end -s. This would catch a few non-standard plurals, like indices or theses; but those would add up to no more than dozens, or hundreds at most, among millions. But the query timed out after finding the first 7500 distinct words (the most common of all was things at 40,453 – a clear 11,000 ahead of the field), by which stage the search had only worked its way down to words that had a total of 27 hits.  For comparison, in plural nouns ending -n, the search worked its way down to 23 (there was  no 24, 25, 26 or 27) after listing about 96% of all possible hits. Extrapolating from that we can estimate that if a search reaches 27 after 4.5M hits there will be a total of something like 5,000,000 (N = 5 – so my fag packet calculation wasn't too far out).

I've crunched some numbers, thinking at first in terms of some pretty pie charts. But the difference between -s plurals and all the others was so great that pie charts wouldn't be very interesting: most non-s endings would get a tiny (often nearly invisible) sliver. I've shown my working here (none too legibly I'm afraid):

More legible version

In fact, rather than a pie chart, a more helpful image would be a clock-face. The sector occupied by all non-s plurals added together would be the area between 12 o'clock and about 4 minutes to. The only families of non-s plurals that would account for more than a minute or two would be irregular English plurals of all kinds (folk, men, children, feet, teeth,...) and Latin plurals  – mostly ending with -i, but sometimes ending with -a, or e; the few Latin plurals that end -ūs  (in Latin, as for example syllabus does) are of course lost among the -s endings  – if there are any in BNC.

There. There are some numbers. I may try a similar trick on another corpus; on the other hand I may get on with #WVGTbk2. 


Added link to spreadsheet.

Asking questions

Addenda agenda corrigenda memoranda propaganda pudenda...

The time has come, unfortunately, for the pointless, annoying, never-ending discussion about the plural of THE R WORD.

Let's take as our starting point  The Speech of Cicero for Aulus Cluentius Habitus:

This referendum ad populum ["the putting of a question to the people"] was soon abridged to plain referendum; but the phrase shows that the word was, in Latin, a gerund. Now I'm not going to argue that English has to follow the rules of Latin. That ridiculous notion has long plagued studies of English. But to quote one distance learning site:
Forming the gerund: The gerund is formed much the same way... . All gerunds are considered neuter nouns and there is NO nominative case and NO plural form.
OK, there is no plural of referendum  in Latin; so how do we form it in English? There is little doubt about how plurals are formed in English. In most cases (and I wonder how to quantify that mosthmmm) the rule is simple: add an s. Phonologically it's not quite that simple: dependent on what's being pluralized, you add either /s/ or/z/ or /i:z/ or /ɪz/. But there are quite a few exceptions: sheep/sheep, man/men, ox/oxen, basis/bases...

Then there are foreign borrowings: Latin – medium/media, Greek  – criterion/criteria, Hebrew seraph/seraphim... as many as the language has borrowed, and as many as will be borrowed....This gives many opportunities for linguistic snobbery:

My dear, did you hear that? 
"Criterions" – Where did HE go to school?

Naturally, in the face of this, hypercorrect forms such as criteriaare common. People think they should use the foreign pluralizer, but  the native one interferes. And sometimes a foreignified version becomes so commonly used that it becomes standardized. This seems to be what has happened to referendum. It wasn't until I started researching for this post that I came across this:

Well as long as I live I'll keep saying referendums. But I'm afraid the feeling that "formal" contexts call for a parade of ignorance is gathering momentum.


I saw a programme in the Les Hommes de l'Ombre series last night, and it reconfirmed my belief. I have referred before to Gaston Dorren's Lingo . It's still on my Guilt Pile, but some day I'll finish it; and I have read the chapter on French, unfortunately called Mummy Dearest. Dorren's point is that the French language always has an eye on its mother, Latin. There are, of course, many Francophones who know no Latin; but his point is about the relation between the spoken and written language. When a French person says, for example, ils aiment, the -nt has no phonemic value. But in writing it resurfaces. Sometimes, one of these Latinate fossils reappears (resounds?) in speech, because of a phonological rule: il est aimable, for example.

extract from Lingo

Dorren's point could have been more carefully made (that chapter heading, for example). But there is a grain of truth in it:modern pronunciations in many Romance languages hark back to a Latin spelling; elsewhere I have mentioned Italian pronunciations of -ezzo:
...Italian native speakers pronounce mezzo with the voiced affricate /ʣ/ and prezzo with the unvoiced affricate /ʦ/ without – for the most part – knowing the reason: that the one with voicing is derived from MEDIU(M) and the one without voicing from PRETIU(M). Yet I've never heard a mezzo-soprano called (in English) a /meʣəʊ/. Of course I'm not saying the English pronunciation 'should' have the /ʣ/;  it's just interesting that it doesn't.
But French prends la galette, as it were, when it comes to harking back in this way: Latin is never far from the surface of French, and English has no equivalents of Augustan poets like Corneille and Racine. Pope comes close, but his classicism strikes me as more superficial.

Returning to that TV programme, one  of the characters said "Je n'aime pas les référendums". When I heard this I was relieved to learn that French hasn't been infected by the rot of  supposedly formal hypercorrection.

