Size matters when it comes to corpora. At 220 million words of text, the corpus used to create the second edition of the COBUILD dictionary in 1995 was over ten times the size of the one used for the first edition, and 220 times bigger than the first electronic corpora developed in the 1960s and early 1970s. Yet it was tiny compared to those we use today, some of which amount to billions, not millions of words.
To give an idea of the amount of information involved: suppose you are compiling a medium-frequency verb like proclaim. In the British National Corpus (BNC), which was fixed in 1993 at around 100 million words and has not been expanded since, there are just over 1,000 results (known as ‘citations’) for proclaim. It is possible, given time and the necessary expertise, to look at every one of these citations and give a good account of the word’s meanings and behaviour in a dictionary entry (or elsewhere).
But what about a high-frequency word like take (174,000 citations in the BNC) or hand (just under 50,000)? In today’s huge corpora the numbers are far greater: the most frequent words have tens of millions of citations, while some (and, the, he, have and so on) number in the hundreds of millions. Even relatively uncommon words can have tens of thousands of citations.
Fortunately there are software tools and other methods for efficiently extracting the information that corpora hold. Modern corpus search software gives an overall picture of a word by displaying it on the screen in a way that shows how it combines with other words. It shows the search word together with its collocates – the words it combines with most frequently – and tells you how significant these combinations are. Each collocational or grammatical chunk displayed can be expanded, allowing you to examine it in more detail if necessary.
The other essential tool in a lexicographer’s armoury is sampling. It was one of the insights of COBUILD’s founder, Professor John Sinclair, that you can tell a great deal about a word’s meanings and behaviour from a small representative sample of corpus citations: in many cases a screenful or two is enough. So, a combination of the overview of a word’s collocational and grammatical behaviour, together with a more detailed look at a small sample of lines, generally provides sufficient information to compile a new entry or revise an existing one.
It is worth remembering that corpora could not have expanded to their current enormous size without fast computer connections. In my early days as a freelancer, when dial-up was the only type of connection widely available, you could literally start a corpus search for a frequent word, go away and make a cup of tea, and come back to find the search was still running. Today with high-speed broadband a search even for a very frequent word returns a result within a few seconds.
Corpora are used today in many different ways for different purposes on different dictionary projects. At its most basic, a corpus can provide authentic examples of how a word is used. At the other end of the scale, detailed corpus analysis continues to reveal new and surprising information about the collocational and grammatical behaviour of even the most familiar words. As new ways of using language come into being, a regularly updated corpus allows us to keep track of them. While the ways in which corpora are built and used have changed greatly over the past thirty years, it has become more or less unthinkable to compile or revise a dictionary without reference to the evidence provided by a corpus.
This blogpost has been written by Liz Potter, who is a freelance lexicographer, editor and translator.
Find out more about our new editions of the Collins COBUILD dictionaries and other COBUILD materials here.