An Insight into Corpus: Identifying New Words and Meanings

An Insight into Corpus: Identifying New Words and Meanings

27/09/24

Neologisms and the constant evolution of language fascinates dictionary editors and users alike. For lexicographers, the use of corpora is essential to the way in which new words are identified, monitored and recorded.

Let’s start with a little bit of background about the use of corpora in lexicography: as one of the very first modern corpus linguists, Professor John Sinclair led the COBUILD project that resulted in the creation of the largest corpus of English language texts in the world. Professor Sinclair was also instrumental in developing the tools needed to analyse the data, enabling lexicographers to find out how people really use the English language and to develop new ways of structuring dictionary entries. This led to the very first edition of the COBUILD Dictionary being published in 1987, the first of a new generation of dictionaries based on evidence of how English is really used. Under his guidance, Professor Sinclair’s team also developed a full-sentence defining style, which not only gave the user the sense of a word, but also showed that word in grammatical context. The COBUILD dictionary therefore became the first dictionary based on real examples of English – the type of English that people speak and write every day.

Today, the Collins Corpus is a vast database made up of over 20 billion words from the English language. It is regularly updated with data from a range of sources, including fiction, non-fiction, newspapers, websites and social media. The Collins Corpus also includes data from around the world, for example, the United States, Australia, South Africa and Canada to show how English is used by native speakers globally. The size, scope and up-to-date content of the corpus therefore offers a crucial insight into how language is used today.

The text in the corpus is presented to our lexicographers as concordances, that is individual lines of text from different sources. You don’t see a full extract and you can’t read a whole text! Editors can then arrange the lines in different ways depending on what they want to look at. They can analyse meaning and usage, frequency, and collocation of the word in question. This helps them to identify new meanings or parts of speech which might be worth further investigation.

Let’s look at a few concordances for ‘stream’.

As you can see, there are various parts of speech and meanings of ‘stream’ shown here:

As a noun:

  1. meaning ‘a small narrow river’ eg a mountain stream, surrounded by wildflowers
  2. meaning ‘a steady flow’ eg a stream of insults; the stream of pilgrims; a new stream of work
  3. in a compound noun jet stream
  4. in the internet meaning eg the live internet stream

As a verb:

  1. meaning ‘to flow’ eg tears streamed down his face
  2. in the internet meaning eg to stream meetings online; software used to stream video feeds; streamed live on the internet

Plus, there is evidence of the used of the participle ‘streaming’ being used as a modifier, for example, streaming services.

Even this very small snippet of concordances of ‘stream’ gives lexicographers more than enough evidence to look more closely at the meanings which have come into use in the field of internet technology. Analysis of this newer usage will allow editors to compile dictionary entries to reflect what has been found in the corpus:

 

In addition to concordances, we can use a feature of the corpus to look at words that collocate closely, and this also helps us to identify new meanings. For ‘takeaway’ below, the word ‘key’ is of interest as it doesn’t apply to the known meaning of ‘takeaway’ (= hot cooked food that can be bought at a restaurant and eaten elsewhere).

 

Our lexicographers can then examine this usage further in the corpus and pin down its usage – here are some of the many concordances of the collocation ‘key takeaway’. It is clear from these extracts that it relates to something that has been learned or gleaned from a discussion or report.

This is then reflected in the dictionary as:

In addition to new meanings, new parts of speech also emerge as part of our corpus analysis: in the case below, the adjective ‘weird’ is showing as a verb with a 3rd person singular ending ‘s’:

In fact, a closer look shows that it is indeed a phrasal verb – ‘weird out’, subsequently reflected in the dictionary as:

And, of course, neologisms themselves appear through the regular updating of corpus data. Our programs constantly compare old and new data to pick out completely new words that are coming into the language so that we can monitor these usages. The concordances below show the use ‘all-hands’ as a noun and as a modifier:

However, it’s not just editors and lexicographers who have a say in what can be added to the dictionary. On the Collins Dictionary website, users are encouraged to submit words that they have come across or use that are not already shown in the dictionary. Some of the words that have been suggested recently include hybrid training, streaming farm, malinformation, and AuDHD.

These submissions are then reviewed and monitored by our team with a view to adding them to the dictionary should they meet our usual inclusion criteria. There is something quite satisfying about seeing a word that you have suggested making it into the dictionary!

To submit a word, go to https://www.collinsdictionary.com/submission/ You will be asked to register which is a simple and quick process. We look forward to seeing your suggestions!         

Heraclitus is quoted as saying, ‘The only thing that is constant is change.’, and that is certainly true of language. Thanks to tools like the Collins Corpus at our disposal, we can make an attempt to track and pin down some of these changes, but it is a never-ending – though never thankless – task!

This blog post was written by Maree Airlie. Maree is a lexicographer working in the Languages Team for HarperCollinsPublishers. She is also a qualified primary teacher in the UK.