Monday, 21 March 2016

Ex unibus plurum - "wuggen" revisited

In my last post I stumbled, in passing, on an idea; some of you may have noticed the "hmmm". I thought it might take the form of an update, but it has turned out to be a bit more substantial than that. The idea was to quantify the different ways English has of forming a plural. When I wrote this post nearly two years ago it didn't occur to me to wonder. I was content to say
English has lots of ways of pluralizing a noun – no change (sheepfish...), change -us to -i (radius → radii...), add -en (ox → oxen [or do something else involving '-en' {child → children, brother →  brethren...}]), change -ex or -ix to -ices (matrix → matrices) etc, but by far the most common device is to add an s (though this simple idea hides several options [/s/ {rabbits}, /z/ {gardeners}, /ɪz/ {radishes}]. What is the word for 'more than one wug'? Wugs, of course, with /gz/.
It's fairly obvious to a  native speaker that the most common way is to  add an s. In fact, this rule becomes apparent whenever a young language learner mistakenly adds an s to an irregular plural – sheep becomes sheepS rather than sheep, for example, and when an adult corrects mouseS to mice, the compulsion to keep faith in the add-an-s rule is so strong that the next attempt is quite often miceS.

But I wondered how I could put a number on that. The obvious source of data seemed to me to be the British National Corpus, though it is relatively small, at a mere 100,000,000 words. Some of the publicly available corpuses...
<I_know_ I_know subject="corpora">
There are people  who  say that corpora is the"correct" plural; some readers may have had the misfortune of being taught by someone who believed so; Firefox is trying to correct my spelling.  The latinate plural is not wrong, but I adhere to Fowler's belief (in The King‘s English)
 ...that all words not English in appearance are in English writing ugly and not pretty, and that they are justified only (1) if they afford much the shortest or clearest, if not the only way to the meaning ... or (2) if they have some special appropriateness of association or allusion in the sentence they stand in.
Elsewhere (maybe Modern English Usage) he gives the advice that you‘re less likely to make an embarrassing mistake (like mistaking a latinate -us word for a second declension noun instead of a 3rd [such as corpus] or 4th declension one [syllabus, for example], and giving it an -i ending), and more likely to be understood, if you use a native English s plural ending whenever it's possible.
</I_know_ I_know>
... have many more.

It is possible in principle to construct a query that requires a search engine to return all the nouns in it that end with a certain string. But accompany me, if you will, in a thought experiment. Suppose for the sake of argument that in any text in the corpus the percentage of nouns  is N%.
A few examples, followed by the count of plural  nouns:
  1. The cat sat on the mat.                               N=0 
  2. There is a tide in the affairs of men...      N=1
  3. Softly softly catchee monkey.                   N=0         
  4. The wages of sin is death.                         N=1
  5. When shall we three meet again?            N=0
  6. Honey I shrunk [sic] the kids                    N=1
  7. Where have you been all day?                 N=0
In this mini-corpus (perhaps I should make that nano-corpus) there are 3 plural nouns out of a total of 50 words. They're not too common, plural nouns; 6% in this case, though in for example a recipe book the figure would be much higher. In BNC, that would be 6%  of 100,000,000 – 6,000,000. 

This is admittedly  a VERY dodgy sample; but my point is that even a tiny value for N leads to a big number in a corpus such as BNC. 

At the British National Corpus I asked for all the plural nouns that end -s. This would catch a few non-standard plurals, like indices or theses; but those would add up to no more than dozens, or hundreds at most, among millions. But the query timed out after finding the first 7500 distinct words (the most common of all was things at 40,453 – a clear 11,000 ahead of the field), by which stage the search had only worked its way down to words that had a total of 27 hits.  For comparison, in plural nouns ending -n, the search worked its way down to 23 (there was  no 24, 25, 26 or 27) after listing about 96% of all possible hits. Extrapolating from that we can estimate that if a search reaches 27 after 4.5M hits there will be a total of something like 5,000,000 (N = 5 – so my fag packet calculation wasn't too far out).

I've crunched some numbers, thinking at first in terms of some pretty pie charts. But the difference between -s plurals and all the others was so great that pie charts wouldn't be very interesting: most non-s endings would get a tiny (often nearly invisible) sliver. I've shown my working here (none too legibly I'm afraid):

More legible version

In fact, rather than a pie chart, a more helpful image would be a clock-face. The sector occupied by all non-s plurals added together would be the area between 12 o'clock and about 4 minutes to. The only families of non-s plurals that would account for more than a minute or two would be irregular English plurals of all kinds (folk, men, children, feet, teeth,...) and Latin plurals  – mostly ending with -i, but sometimes ending with -a, or e; the few Latin plurals that end -ūs  (in Latin, as for example syllabus does) are of course lost among the -s endings  – if there are any in BNC.

There. There are some numbers. I may try a similar trick on another corpus; on the other hand I may get on with #WVGTbk2. 


Update 2016.03.21.14:30 – Added PS

PS  Here's a clue:

Stubborn – gathering information on the way (12)

Update 2016.03.23.14:50 – PPS

Added link to spreadsheet.

Update 2016.04.25.11:35 – PPPS

PPPS Time. The answer to that clue: INTRANSIGENT

Update 2018.06.10.10:25 – A few typo fixes

Update 2018.10.23.14:05 – Updated linked spreadsheet (but left old screen-grab as is – I'm sure anyone who's interested in the figures will look in the spreadsheet anyway).

No comments:

Post a Comment