剑桥学习者语料库建设：挑战及其解决方案

2020-04-03奈特

华文教学与研究 2020年1期

本·奈特

（剑桥大学出版社语言与教育研究中心，英国，剑桥CB2 8BS）

1.Corpus development at Cambridge University Press（剑桥大学出版社的语料库建设）

Cambridge University Press has been using corpora of English for nearly thirty years.The initial reason for developing a corpus was to inform dictionary development，and corpus analysis continues to be a major part of developing，maintaining and publishing dictionaries.However，Cambridge has extended the use of corpora to other areas of language resources，including course books and supplementary learning materials.This paper will describe the range of corpora used by Cambridge，and will in particular focus on one of those，referred to as the Cambridge Learner Corpus.It will discuss challenges that this work faced and how those have been met.It will focus more on the practical aspects of applied corpus research，rather than the theoretical issues than underpin it.The label‘Cambridge English Corpus'is used to describe the collection of corpora that we have developed to give us insights into the use and learning of English.Within that，we group some subcorpora together as the Cambridge Reference Corpus and others as the Cambridge Learner Corpus.The Cambridge Reference Corpus is itself a collection of corpora that include texts（spoken and written）from‘expert users’.Expert users includes people who first language is English as well as people who use English regularly for their work or public activity.For example，this includes those whose first language is not English，but who have published articles in English，or who give presentations in English，or who participate in meetings or discussions in English.Within the corpus，texts are often marked as British English or American English，though this categorization may sometimes apply more to the context in which they were written，published or spoken，than to the identity of the speaker/writer.

2.The Cambridge Reference Corpus（剑桥参照语料库）

The purpose of the Cambridge Reference Corpus is to provide a reliable picture of how English is really being used.The challenge has been to en-sure a wide and balanced sampling of language used in different contexts.It is possible to build very large corpora by taking large numbers of texts from the internet in an automated process①See Sharo（2006） for discussion around the different methods and issues with creating corpora through web-crawling methods..However，ensuring that the corpus has an appropriate balance between different contexts，and sufficient quantity for particular areas of interest，takes considerable planning，monitoring and maintenance.The Cambridge Reference Corpus currently contains over 2 billion words，and includes both spoken and written English，collected over the past twenty years.Written English includes newspaper articles，fiction and non-fiction books，magazines and journals，website text，letters，emails，etc.Spoken English includes social and business conversations，meetings，interviews，speeches，presentations，phone calls and TV and Radio broadcasts.Corpus analysis is carried out on transcriptions of recordings，with the original recordings archived for reference.The spoken part of the Cambridge Reference Corpus also includes the newlycollected Spoken British National Corpus 2014，a 10 million word collection of contemporary spoken English-see Love et al.（2017a）for both a description of this corpus and for a discussion of the methodology used to construct it.

The Cambridge Reference Corpus is continually developing，but the current breakdown of some key language areas within it is shown the table 1.

There are a number more specialised domainspecific corpora within the Cambridge Reference Corpus.CANBEC（Cambridge and Nottingham Business English Corpus）is a collection of spoken business English recorded in companies of all sizes，from big multinational companies to small partnerships.Formal and informal meetings，presentations，conversations on the phone，over lunch etc.were recorded，typed into the computer for analysis by authors and editors.This helps us to find out how real people speak and use English today in a work environment，how business language reallyworks，how to teach it better and how to make better language learning materials for business English.

Table 1：The current key languages areas in CRC

CANCODE（Cambridge and Nottingham Corpus of Discourse in English）is a unique collection of spoken English recorded at hundreds of locations across the British Isles in a wide variety of situations：casual conversation，people socialising together，people shopping，people finding out information，discussions，and many more types of interaction.These conversations have been transcribed and annoted for analysis by authors and editors.Only spontaneous speech is found in the CANCODE corpus.This corpus has been drawn on substantially for various publications，including theCambridge Grammar of English（Carter & McCarthy 2006），and an explanation of the corpus's design and use can be found in Carter and Adolphs（2003）.

The Cambridge University Press/Cornell Corpus is a large collection of English being spoken by Americans.It includes recordings of people going about their everyday life— at work，at home with their families，going shopping，having meals and so on.The conversations have been transcribed and annoted，enabling researchers and authors to find out how American people speak and use American English today，how their language really works，and how we can produce better teaching materials for American English.

CANELC（Cambridge and Nottingham e-language Corpus）is a one-million word corpus of digital communication in English，taken from online discussion boards，blogs，tweets，e-mails and Short Message Services（SMS）.It is the output of a joint project between Cambridge University Press and the University of Nottingham，and for information on how this was constructed see Knight，Adolphs and Carter（2014）.

CAMCAE（Cambridge Corpus of Academic English）has been created to support the identification of specific features of academic English，including the variances in different academic disciplines，settings，and stages of learning.This is a collaboration between Cambridge University Press and Cambridge Assessment English.By collecting academic writing texts from high school to published academic authors，and all the stages in between，the corpus provides a useful source for understanding how proficiency in academic English develops as a separate trait from general language proficiency.

BNC 2014（The British National Corpus 2014）is large corpus of contemporary English from a range of real-life contexts.The project was led by Lancaster University，and Cambridge University Press was a key partner in the gathering，processing and annotation of 11.5 million words of authentic spoken English.It follows on from the original BNCproject in the early 1990s，and is particularly useful for examining language change in English over the past two decades.Information about the BNC 2014 can be found here：http：//corpora.lancs.ac.uk/bnc2014/and access to the data can be found this portal：http：//corpora.lancs.ac.uk/BNCweb/.

The Cambridge Reference Corpus is naturally essential for dictionary development，providing an evidence base for the definition and usage of words and senses.In the development of learning materials，the Reference Corpus is used to ensure that Cambridge courses are teaching students English which is most useful and authentic for students.Authors and editors will check texts and activities against Reference Corpus to make sure the language sounds natural（within the limits of the level）and reflects high frequency occurrences of English.The corpus provides an insight into which words and phrases are becoming less frequent in contemporary usage and therefore may be given less prominence in our courses，and some recent examples of this are‘marvellous’，‘marmalade’，‘fortnight’and‘wardrobe’.The Reference Corpus can help identify patterns of usage that may not be clear to individual authors and editors.For example，the words‘nobody’and‘no-one’are considered synonyms as they have the same meaning，but the corpus shows us that‘nobody’is more common in spoken English and‘no-one’is more common in written English.Another example of corpus insight human judgement is unlikely to be aware of is the way that‘must’is mostly used to express‘obligation’in written English，but in spoken English is commonly used for‘deduction’（as in‘you must very tired now’）.

3.The Cambridge Learner Corpus（CLC）（剑桥学习者语料库）

The most significant challenge for Cambridge has been developing a learner corpus，which forms the basis for research into how learners at different stages of proficiency use English.For a detailed discussion of the issues relating to learner corpora，see Granger et al（2015）.The first challenge was to obtain the texts in form that could be associated with a reliable measure of their English proficiency.By using exam scripts，the corpus is able to obtain validated measures of proficiency that are linked to the text and associated data.The Cambridge Learner Corpus，started in 1993，is a 60 million-word（and growing）annotated corpus of Learner English from Cambridge English exam scripts，taken from over 180，000 learners，from 200 different countries and with 152 different first languages.The texts are coded for a number of features-e.g.part of speech，….-as well as error coded.They are also linked to metadata about the learners-the first language，their country of resi-dence，gender，age，education history and years of English study.The texts come from learners with over 150 different first languages，and there are 12 first languages with over a million words of learner data.This metadata is used to filter the corpus，effectively creating subcorpora for more focused analysis.For example，a subcorpus can be created for the English of secondary school children in Turkey at A2 level.

The most valuable part of this is the level metadata：as these texts are part of a validated Cambridge exam，it is possible to be very confident of their overall proficiency level of English at the time of writing the text.The texts are taken from eleven Cambridge English exams①IELTSexam data were originally included in the Cambridge Learner Corpus，but were later withdrawn following discussions with the IELTSConsortium.：Proficiency，Advanced，First，Preliminary，Key，Business Preliminary， Business Vantage， Business Advanced，BULATS（Business Language Testing System），Skills for Life and Young Learners Exams.It also contains data from two specialised exams that have been discontinued：International Legal English Certificate（ILCE）and International Certificate in Financial English（ICFE）.This makes the Cambridge Learner Corpus almost certainly the largest learner corpus of its kind，containing reliable level data alongside detailed error-coding.For a review of learner corpora：

4.Building up the CLC（剑桥学习者语料库的建设）

The corpus is built up every year with new data.After an analysis of where there is a need for additional data，for example where certain levels or first languages are under-represented，suitable scripts are selected from Cambridge English archives.They are keyed in as digital files，and linked to candidate information and score data.The texts are lemmatised and parsed，using TreeTagger，and the parts of speech tag set used is given here：https：//www.sketchengine.eu/english-treetagger-pipeline-2/.They are then coded for errors，and details of this are given in the next section.In addition，a‘corrected’version of any errors is added which is very useful for identifying certain types of error.For example，a number of errors may have the same error of a missing preposition，but it is only through analysis of the corrected versions that we can identify if certain prepositions are more frequent than others，and in what contexts.

5.Error-coding in the CLC（剑桥学习者语料库的偏误标注）

A significant challenge for the CLC project has been establishing the most effective approach to error-coding.This requires difficult decisions about the level of detail in the coding：increasing the level of detail may reduce the inter-coder reliability of tagging，while reducing the level of detail can miss data that is critical for the analysis.The error-coding system used for the CLC was developed by Dr Diane Nicholls，in conjunction with Cambridge University Press.The basic convention is as follows：<#CODE>wrong word|corrected word.The majority of the error codes are based on a two-letter coding system in which the first letter represents the general type of error（e.g.wrong form，omission），while the second letter identifies the word class of the required word.General types of error（the first letter）includeF（wrong Form used），M（something Missing），R（word or phrase needs Replacing），U（word or phrase is Unnecessary or redundant），andD（word is wrongly Derived）.The second letter represents word class，and includes codes such asC（Conjunction），N（Noun），V（Verb），etc.

Punctuation errors are coded withPas the second letter，and one of the error typesM，R，Uas the first letter，MP=punctuation Missing.Various types of Countability errors are coded，such asCDwrong Determiner because of noun countability，orCNcountability of Noun error.There is a set of Agreement error tags（AG+word class），such asAGD（Determiner agreement error）andAGV（Verb agreement error）.All‘false friend’errors are tagged with FF，but only where this is a documented False Friend（English words or phrases that look similar to those in another language but have a significantly different meaning or usage）.Otherwise，it is treated as a replace（R）error.

There are a number of other error tags to represent other types of error，such asAS（incorrect Argument Structure），CL（CoLlocation error），CE（Complex Error）andS（Spelling error）.AS（argument structure error）covers errors in argument structures which cannot be coded as MT（missing preposition，e.g.*he explained me）or UT（unnecessary preposition，e.g.*he told to me）.ASis particularly used for double object verbs，e.g.*it caused trouble to me is coded<#AS>it caused trouble to me|it caused me troubleto circumvent the need for multiple codes to correct what is，in fact，a single error.

CE（complex error）is a catch-all code to cover multiple errors and groups of words the intended sense of which cannot be established.By using this code，we factor out of the equation strings which can yield little useful information on learner errors.

Error-coding is a completely manual process，with a small number of trained and experienced coders，whose work is supervised and cross-checked.As this is a considerable investment，not all content in the CLC has been errorcoded yet.About half of the over 60 million words have been error-coded，with additional coding added each year.Cambridge has also worked on semi-automating the error-coding，and for details of this see Andersen（2011）.

The cost of error-coding is weighed up against the various advantages it brings，particularly when put together with‘corrected’text.Because correct uses are automatically taggedNE（no error），it is easy to deselect correct uses and focus only on errors in the text.This also allows the possibility of comparing what learners get right（an often neglected area in ELT）with what they get wrong.The corpus analysis software has built in statistical tools that easily establish the frequency，level of student or mother tongue for a given error（or correct use）.It also becomes possible to search for errors of omission as well as commission.A concordanced search on‘at’，for example，in an uncoded corpus，allows you to locate errors such as the unnecessary use of the preposition but does not identify instances of failure to use the preposition where it is required，or instances of where‘at’should have been the chosen preposition，but a wrong preposition was chosen instead.

Error-coding allows the researchers to search on the error tags themselves，collecting together all errors of a particular type，comparing frequencies of occurrence in different contexts，and identifying which L1s or at which levels those errors are most likely to occur.For more information on the advantages and uses of error-coding，see Nicholls（2003）The Cambridge Learner Corpus-error coding and analysisfor lexicography and ELT.

This is an example of some error-coded text，written by a learner whose first language was Swiss German and who was taking the Cambridge English Preliminary exam.

As soon as I saw the handwriting on the envelope I smiled.It was a letter from my<#RP>penfriend|pen friendin Germany.She's called Lisa and I write her a letter every month.Five years ago she<#TV>has moved|movedthere.We were both very sad about that，because she was my best friend.Her father<#TV>has|hadfound a new<#RN>work|jobin Berlin so they had to go there.

I've already visited her twice and stayed there for one week.In this letter she said：<#RP> ，|“ My parents<#TV>decided|have decidedto move back to Switzerland！”I was very happy.I began to dance and took my mobile phone to call her.We spoke<#MT>…|forabout half an hour，then I went to my table and<#AS>wrote her back|wrote back to her.

In reports for authors and editors，the codes are usually removed，and the errors are presented in red strikethrough text，and the corrected text in green：‘Her father has|had found a new work|job in Berlin so they had to go there.’

6.English Profile Project（英语描述项目）

Cambridge has been closely engaged with the development of the Common European Framework of Reference（CEFR），and one of the major challenges has been to make the CEFR more specific for English language teachers and learners.The English Profile Project was begun in 2007，with the aim of describing what learners of English could typically do or not do at each of the stages of the Common European Framework of Reference（CEFR）.The project was partially funded by the European Union，and was led by Cambridge University Press and Cambridge Assessment English（then called Cambridge ESOL），with contributions from the British Council，the University of Bedfordshire and English UK.More information on this project can befound here：https：//englishprofile.org/

English Profile initially focused on lexis and grammar，and used the CLC as its main source of empirical data.This was supplemented by the English Profile Learner Corpus，with contributions from institutions around the world.The English Profile Learner Corpus does not have as reliable metadata about the language proficiency level of the learners producing text（compared with the CLC），but it was able to provide more spoken learner data as well as data from tasks that are not typically included in exam contexts.A team of researchers analysed patterns of the mastery of lexis and syntax to develop‘profiles’of typical English language learners at each CEFR level.These profiles aimed to generalise across learners of different L1s.Naturally，the research did at times identify where certain patterns were more common among certain L1 learners，and in theory it would be possible to create variations of the English Profile for various L1 learners.

The work on lexis developed into the English Vocabulary Profile（EVP），and was led by lexicographer Annette Capel.One of the important features of the EVP is that each sense of a word is treated as a separate entity，as each sense of a word is often mastered at different CEFR levels.For example，the wordcasehas a number of different senses，and the sense of‘a container’is typically mastered at A2，whereas the sense of‘a situation’at B1，acaserelating to‘crime’at B2，and its sense as‘an argument’at C2.For a detailed explanation of how she undertook this task，and issues it raised，see Capel（2010）.The EVPis freely accessible via a searchable database which can be accessed here：https：//www.englishprofile.org/wordlists.

The work on syntax developed into the English Grammar Profile（EGP），and this project was led by Anne O’Keeffe and Geraldine Mark.As with the EVP，the researchers found that the different functions or meanings of specific grammatical forms would be mastered at different CEFR levels，and so the EGP has a similar‘polysemic’approach to the EVP.For example，using the simple past tense to talk about events in the past is first mastered（in a very basic affirmative statement form for a limited range of verbs）by most learners at A1，but using the past tense to convey politeness（as in‘I wondered if you were free for a moment？’）is not typically mastered by learners until B2.For a detailed description of how this project was undertaken and key points raised by it，see O’Keeffe and Mark（2017）.Information about the English Grammar Profile，and free access to it，can be found here：https：//www.englishprofile.org/english-grammar-profile.

7.Using the CLC in the development of learning materials（剑桥学习者语料库在学习材料开发中的使用）

Having developed the CLCover the past twenty years，the challenge has been to make use of it to improve the content and design of Cambridge learning materials.One of the most visible uses of the CLC has been in the development of coursebooks targeted at speakers of a specific first lan-guage，by filtering all the error analysis on the L1 metadata.In Spain particularly，a whole range of courses were developed，under the banner of English for Spanish Speakers，and these drew heavily on insights from the CLC into frequent errors among learners with Spanish as their L1（and resident in Spain）.For other countries，Cambridge has adapted international course-books to add sections which provide a focus on aspects of English which commonly cause problems for learners with that L1.

The two graphs below represent the breakdown of types of error made when using the simple past tense，at various levels，comparing Chinesespeaking and Arabic-speaking learners.The first column in each graph shows the average across all CEFR levels，and then the other columns show how different types of error increase or decrease over each level.#TV indicates the tense is wrong，#RV that the wrong verb has been used，#Sa spelling error，#IV that a verb has been wrongly inserted，and#MV that a verb is missing.To some extent，an increase in errors can indicate that the learners are attempting a wider range of structures or vocabulary.It is common to see an increase in errors from A1 to A2 and B1.Arabic speakers are generally making more errors than Chinese speakers，although Chinese speakers appear to struggle with getting the right tense even at high levels of proficiency.

Figure 1：The types of error made by Chinese-speaking learners

Figure 2：The types of error made by Arabic-speaking learners

Another way that the CLCis used is to identify areas of English that have high error rates for learners at a particular level and age group.This may be focused on specific areas that are being focused on at that stage of the writing.For example，during the development of a course for teenagers at B1，the publishing team may want data on the main difficulties that B1 teenagers are having with past tenses.The CLCprovides them with that data，which may influence decisions about how to present and practise the past tense in that course.

The CLC can help authors and editors by providing texts that have genuinely been produced by learners that match the profile of those that are targeted by the course，including the level，age and L1（if relevant to that course）.For example，the CLCcan provide a range of texts（e.g.sentences）where university students at B2 level have used lexis relating to holidays.This helps the editors ensure that the language in the activities is appropriate for those types of learner.In some courses，the authors include actual student responses from the CLC in the Student Book，so that learners can review them and correct them as appropriate.McCarten（2010）provides more examples of how corpora can be used to improve the design of language courses.

Naturally，the CLCis particularly valuable for exam preparation courses，as it provides authentic examples of the answers that students at that level produce.In a preparation course for Cambridge English First，for example，there are anonymised parts of texts given which illustrate strong and weak performance in a certain parts of the exam.

Another use of the CLC is in the construction of the syllabus for a course，especially for grammar and vocabulary.The content development team-a combination of authors and editors-will request corpus-based reports on，for example，the use of the past tense across A1-B2.This will guide them in deciding which aspects of the past tense are appropriate for teaching at each of the stages between A1 and B2.Where the L1 of the learners is known（i.e.it is being written for specific countries），these syllabus decisions will also draw on identification of‘false friends’or cognates between English and the L1.For a detailed discussion of designing grammars based on learner corpora，see McCarthy（2016）.

The CLC is also useful for identifying where students are over-using or under-using a particular construction or lexical word/phrase.For example，in a report on Turkish learners，the under-use of‘should’as a modal was noted：

‘The most common error learners make withshouldis to miss it out completely.This error is particularly common with the verbbring：

·You|should bring some of your friends.

·You|should bring your pjs.

·You|should eat healthy food.

·Ithink you|should buy this phone.Learners often use another modal whereshouldis more appropriate.Less commonly，learners sometimes useshouldwhere another modal would be more appropriate：

·You can|should bring money and water.

·You may|should listen to my advice.

·If you want to go，you will|should phone me.

·But on 1 December I should|have to attend a meeting in Boston.’

8.Future developments of Cambridge corpora（剑桥语料库下一步的工作）

There are a number of areas and challenges where Cambridge plans to develop the Learner Corpus further in future.The most significant gap in the CLCis currently the shortage of spoken learner data-based on recordings taken from Cambridge spoken exams.This would enable researchers to investigate how far patterns of language learning are similar or different for spoken and written English.This is related to the challenge of creating a Young Learners’Corpus，as the written texts in Cambridge Young Learners exams are very limited，and a spoken learner corpus would create a more feasible basis for researching the development of English skills among young learners.

One of the criticisms made of the Cambridge Learner Corpus is that it is limited（and therefore skewed）by the nature of tasks within Cambridge English exams.Although the range of tasks is in fact quite varied，simulating integrated communication tasks，it would be preferable to be able to include a wider range of activities from outside exams.For the English Profile Project，there was a strand of work gathering learner data from non-exam activities.However，for a large part of that data，there was not a reliable measure of their CEFR level at the time of completing the activity，and this weakened its usefulness.A future project which links the learners completingthenon-examtasksmoreclearly with recent and validated test results is needed to explore howsignificant task effect isfor thiscorpus.

One of the most difficult challenges is the error-coding，both for its cost and its reliability.Attempts to automate this process have succeeded in reducing costs and errors（see Andersen 2011），but a step change in automation would impact hugely on the feasibility of creating more learner corpora in the future.

9.Access to Cambridge corpora…（如何获取剑桥语料库）

Open CLCis a subset of the CLCthat is freely available via Sketch Engine.https：//www.sketchengine.eu/cambridge-learner-corpus/It contains 2.9 million words from 10，000 student responses taken from the Cambridge English Language Assessment suite of exams-FCE，CAE and CPE-and includes data from a range of L1s.The responses are students from more than 60 countries speaking 7 different first languages.The corpus is uncoded，i.e.does not include error-tagging.