An interview with Signe Oksefjell Ebeling
2019-03-03SigneOksefjellEbeling,XiulingXu
(XLX:Xiuling Xu;SOE:Signe Oksefjell Ebeling)
XLX:Could you briefly talk about the connection between the University of Oslo with LOB and ICAME in the 1970s?
SOE:This is well before my time.The connection between the University of Oslo with LOB and ICAME is very much down to Stig Johansson.Some of this is discussed in Leech and Johansson's article “The coming of ICAME” (ICAME JournalNo.33,2009).As I've understood it,Geoffrey Leech and his team at Lancaster in the UK were building the Lancaster Corpus,as they called it,and they had received some funding,but they encountered tremendous problems not only with funding in the end but also with copyright.So by the mid- to late 1970s,around 1977,I think,Leech had more or less given up.Around this time Stig spent a year at Lancaster and developed a keen interest in corpus linguistics.In 1977,Stig attended a computing for the humanities course in Bergen where he met and befriended Knut Hofland.I think Stig wrote to Leech and suggested that maybe he could help to complete the Lancaster Corpus here in Oslo with support from Knut in Bergen,therefore the name the Lancaster-Oslo/Bergen Corpus.This is also connected to the whole “coming of ICAME”,because since they ran into severe problems with copyright,they thought maybe they could try to impress the publishers with an international organization with this computer archive,and the story goes that ICAME was founded at Stig's kitchen table in 1977.
XLX:The next question is about corpus research here in later periods.The University of Oslo has a long-standing tradition of corpus-based contrastive studies dating back to the early 1990s.How did Prof.Stig Johansson come up with the idea of building a parallel corpus (i.e.English-Norwegian Parallel Corpus) at that time? Because before that time,it seems that he did research on English language only,or in other words monolingual.
SOE:It's a very good question and I'm not sure whether I know why exactly at this time.But Stig,ever since the mid-1970s at least,had been interested in contrastive issues.He'd been interested in error analysis,language learning and teaching.He had a publication entitledPapers in Contrastive Linguistics and Language Testingin 1975.Ideas about the benefits of contrastive analysis,both in an applied and a more theoretical perspective had been with him for a long time.In addition,I think the compilation of a parallel corpus had much to do with technical developments and knowing the right people.I don't know to what extent Stig discussed this with Gale and Church who had an article in 1991 about alignment.It was possible that they met people at conferences and discussed these matters,so the necessary technology was sort of starting to be available.Again,Stig approached his friend and colleague Knut Hofland who had the know-how to deal with alignment and other technical matters.Other people who had an impact were Bengt Altenberg and Karin Aijmer in particular.So,in that sense,I think,for Stig,the time felt right to do this kind of thing.Yet another thing that might come into it is that he ran a project here at the University that was calledEnglish in Norway,with a focus on Anglicisms and English influence on the Norwegian language,so a corpus like this could also be useful for research in this field.
XLX:One more question I want to ask is the relationship between the ENPC and Mona Baker's Translational English Corpus.In his article “Contrastive linguistics in a new key”,Ebeling (2016: 9) mentioned that in July 1993,Mona Baker wrote to Stig Johansson saying that John Sinclair had shown her a copy of Stig's proposal for an English-Norwegian corpus,and that she hoped to be setting up her own corpus of translated texts soon at The University of Manchester Institute of Science and Technology (UMIST).Do you think Baker's TEC was to some extent inspired by the ENPC?
SOE:I'm not sure actually.I don't know how far in her thinking she had got before she got in touch with Stig.I think perhaps that those two initiatives were two parallel things going on at the same time.It's very often the case.So to say that the TEC was definitely inspired by the ENPC,I'm not sure.
XLX:I think both of them are quite pioneering.One is a bidirectional parallel corpus and the other a monolingual translational corpus.
SOE:Yeah,and they are based on more or less similar thoughts coming up at the same time in a way.
XLX:You were a core member of the ENPC team.Could you tell me how you got involved in the project? Was that your first contact with corpus linguistics?
SOE:Let me answer the second question first.This was not my first contact with corpus linguistics.I came as a student to Oslo in the early 1990s,and already in my undergraduate days,we were introduced to corpora,and we knew there was an allocated room with one computer in the middle of the room where you could search the LOB and Brown corpora.So I used LOB,Brown and the Kolhapur Corpus for my master's thesis in 1994.And Stig Johansson was my MA supervisor.
And how did I get involved in the ENPC project? Well,I was done with my master's studies,and the research assistant job on the project became vacant.I applied for it,and got it.
XLX:Parallel corpora were rather new in the early 1990s.Were there any challenges that the team encountered in the compilation of the ENPC?
SOE:You said parallel corpora were rather new at that time.In fact,the ENPC is what we count as the first parallel corpus of its kind.We encountered a lot of challenges in compiling the ENPC.One sad thing is that some of the challenges are still with us today.In particular,the challenge that we heard about for the LOB Corpus,namely copyright,is not getting better,particularly for English texts.It's fair to say that it's easier in Norway these days to get copyright clearance,but for a parallel corpus you need copyright clearance for more than one language,often in countries with different copyright laws.From the very start we had the challenge of(sentence) alignment,for instance.That has now been reasonably well solved.Technical matters were solved along the way and with technological developments all of that has become easier with Unicode for instance.We also had the challenge of getting funding.The ENPC wasn't all that well funded; in fact,I don't think we got much funding beyond the Department and Faculty.More substantial funding only came a bit later with the Nordic Project.
XLX:How about the challenge of the selection of materials? There were more texts translated from English into Norwegian than the other way around.
SOE:In the selection of material to be included you obviously need publications that have been translated between the two languages.We also had to leave out a couple of publications,because they didn't have any kind of punctuation for instance,and it would be difficult to tackle them with sentence alignment.The corpus comprises both a fictional part and nonfictional part,but the non-fictional part is very fragmented,due to the fact that few non-fiction texts are translated from Norwegian into English.Because of that,the selection of non-fiction texts is not very robust.I remember Bengt Altenberg and myself sat down with the non-fiction texts to try to categorize them according to the Dewey Decimal Classification system and we only got one text per slot,really.There was nothing much we could do about that; this has been one of the criticisms of the ENPC,the fact that the non-fiction part is very heterogeneous.
XLX:It is still a big problem for parallel corpora.
SOE:Yes,and maybe more and more so.I know that there is an ongoing project in Sweden.Magnus Levin and his colleagues at Linnaeus University gave a talk at ICAME39 Conference on a translation corpus of only non-fiction texts,English-Swedish-German.They have encountered similar problems,but they keep adding texts,so hopefully this will become a really good resource for translation studies and contrastive analysis.
XLX:Yeah,actually I also came across this problem when I built the English-Chinese parallel corpus,because few journal articles are translated from English into Chinese nowadays.Many Chinese can read English articles and they can even publish in English,so they don't need the translation.
SOE:Exactly,which means that it will be a challenge to compile parallel corpora also in the future.
XLX:What about the proportion of fiction and non-fiction texts? How did you make the decision?
SOE:I think the idea was 50-50,but we ended up with 60% fiction and 40%non-fiction.There are 30 original fiction texts and 20 non-fiction texts in each language.The fiction part was easier to compile and I think,had it not been for copyright,it would be easier to do the same today with fiction,because Norwegian fiction has gained ground over the last few years and it's translated more and more.
XLX:What are the strengths and limitations of the ENPC model?
SOE:It's bidirectional,which means it's aligned to and from both languages.So that's a strength.This means that we believe that we have a strongtertium comparationis; we take translations to indicate that meaning and function are reproduced in the translation and we can check for that due to the bidirectional model.And it makes it possible to discover sets of crosslinguistic correspondences in material like this,compared to comparable corpora where you can't really be sure that what you identify in the two languages really match.So it goes back to thetertium comparationisagain.With the translation paradigms going in both directions,this is what we get.And again,the fact that the corpus is bidirectional makes it possible to check for what has been called translationese or translation effects.For example,if you suspect that something is unidiomatic English in the translated text,you can check that against original English texts,and at least be aware of the fact that this may have to do with characteristics of translation rather than the nature of English in general.So these are the strengths.The limitations we've touched on,such as those that have to do with the selection of texts that you actually have,because not all kinds of texts are translated.And also the balance of the text types between the two languages is hard to maintain in a corpus like this,and probably even worse,for example,between Chinese and English or between Chinese and other Germanic languages.So those are clearly limitations.But I still think the strengths outweigh the limitations.
XLX:The design and compilation of the ENPC were carried out in close cooperation with sister projects in Sweden and Finland.Could you tell me more about those collaborations?
SOE:I think,first of all,that Stig saw the opportunity not only for funding but also for a larger project that would gain interest across a number of languages.I don't know if he had typological studies in mind,but that may have come into it as well.He had good friends in academia,particularly in Sweden,and also Kari Sajavaara in Finland.They were really interested in this project,so collaboration was easy in the sense that these people knew each other and had similar interests.The project resulted in the English-Swedish Parallel Corpus and the ENPC: both have been used in a number of contrastive studies.The Finnish Corpus is unidirectional (En-Fi) and to my knowledge it has not been widely used for research.
XLX:Yeah,and these corpora used many identical English original texts.
SOE:Yes,this was the idea—to have a pool of English texts that all three corpora could use.
XLX:So that was the idea from the beginning.
SOE:Yeah,once the Nordic Project got started.But I think the ENPC was a bit ahead of the others,so they followed suit; even so there is not a full overlap of English texts.The ESPC,for example,contains some English original texts that we don't have and the other way around.
XLX:How was the ENPC extended to Oslo Multilingual Corpus?
SOE:This was towards the end of the 1990s.Particularly colleagues from the German Department (we used to be separate departments back then) got interested in what we'd done with the English-Norwegian Parallel Corpus.In particular Cathrine Fabricius-Hansen,professor of German,teamed up with Stig to include German.First of all,they wanted to do a trilingual multidirectional corpus with English,Norwegian and German based on the ENPC,but in the meantime we had also,for smaller projects,collected translational texts of English and Dutch and English and Portuguese.So some languages had already been added,but the OMC only materialized when they got funding for the projectLanguages in Contrast(SpråkiKontrast,SPRIK).So,you could say that the ENPC generated interests from professors of other languages mainly German,but also from French (with Hans Petter Helland and Marianne Hobæk Haff,both professors of French at Oslo); one of the sub-corpora in the OMS is the French-Norwegian Parallel Corpus (FNPC).There was also some collaboration with the French department at the University of Bergen,and they contributed some of the French-Norwegian texts to the FNPC.
XLX:In addition to parallel corpora,the corpus research team at the University of Oslo have also got involved in several learner corpus projects,for example the International Corpus of Learner English (ICLE),the Varieties of English for Specific Purposes dAtabase (VESPA) and the Idiomaticity Project.Could you say a few words about them?
SOE:Stig got involved in Learner Corpus Research in the late 1990s.He thought the idea put forward by Sylviane Granger with the Integrated Contrastive Model really was a good idea,and he wanted to take part in the ICLE initiative to build learner corpora containing texts produced by learners of English with different L1 backgrounds.So he and a student at that time called Lynell Chvala collected the Norwegian ICLE (i.e.NICLE).And later on,Hilde Hasselgård and myself joined the team in Louvain to build the Norwegian VESPA.
XLX:When was that?
SOE:In 2008 or 2009.With the increased awareness and interest in genre/register differences in language,we wanted to set up a corpus of L2 English student writing in different university disciplines to match the British Academic Written English (BAWE) Corpus which I had previously worked on in the UK.So my experience from working on that corpus project came in handy when we were setting up the VESPA project,although BAWE is not really a resource for learner language research.It's about university disciplines and university disciplinary genres,that is novice writers in the UK,mainly L1 English speakers,but not only.The criterion for a paper to be included in the BAWE was that it had received a good mark.
XLX:So it's a different research focus.That was student writing,whether L1 or L2.
SOE:Exactly.And the overarching project had to do with university genres or disciplinary genres.So,the VESPA Corpus,with its L2 writing in the disciplines can easily be compared with data from the BAWE.And the Idiomaticity Project that you mentioned really links up with all of this,both contrastive and Learner Corpus Research.The Integrated Contrastive Model lies at the core of that.In the Idiomaticity Project,then,we typically use the corpora that we have built ourselves (ENPC,ICLE,VESPA) and also draw on some others for comparison.
XLX:In terms of the recent developments in corpus-based contrastive linguistics,could you talk about the new International Comparable Corpus,i.e.the newly launched international collaborative project you are involved in?
SOE:The International Comparable Corpus or ICC[ik] or ICC[,a si ‘si].We are still debating how we are going to pronounce that.This project was initiated by Anna Čermáková and John Kirk last year (2017),and they invited colleagues and people they know in the corpus linguistics world to take part in this project,where each national team is supposed to collect their national component for the comparable corpus.The idea is that we should reuse as much material as we can from other corpora to facilitate the whole compilation process.So myself and Jarle Ebeling are in charge of the Norwegian part of the ICC,trying to collect or put together a Norwegian component of this particular corpus.The other collaborators so far are people based in the Czech Republic,Slovakia,Poland,Finland,Sweden,Great Britain,Germany.I think nine languages all together so far,including French.And the idea also is that the whole design of the ICC is supposed to follow the ICE (i.e.International Corpus of English).In terms of design it should contain 60% spoken language and 40% written language representing different text types.We had a kick-off meeting in Prague last year where we discussed how to go about the whole thing,and it turns out that it's hard to get hold of all these “old” texts and incorporate them into a new corpus.It has to do with copyright again,and it also has to do with suitability and comparability of what has already been collected.For Norwegian for instance we have very little spoken material that has already been collected that is suitable for this corpus.The plan is to have collected the written part by mid-2019,as this part turns out to be easier to compile.We'll have a poster presentation at a conference in Louvain in September this year (2018) and also have a workshop there to discuss matters to do with the compilation.I think it's a great initiative,but there are challenges.And also the corpus will only contain 1 million words,which may turn out to be very small for contrastive comparisons in some cases.So we'll see what can come out of it,but I think it's worth a try.
XLX:Yeah,it'll facilitate contrastive studies between different languages.
SOE:And particularly the fact that you'll get a comparable corpus of spoken language.This will be the main contribution of ICC,I think.
XLX:How do you view your own role in corpus research at the University of Oslo and your contribution to corpus linguistics beyond UiO and Norway?
SOE:Well,as for my own role at this university,I'd say I teach,facilitate and do research.I'm involved in introducing students to corpora from a very early stage,so even in their first year,the students can be introduced to corpus techniques.The courses I'll be teaching this term to second-year students and master's students are about corpus linguistics,to give them the knowhow to carry out corpus research but I also remind them that at the core of everything is language.We're interested in language,we're not necessarily interested in the corpus as such,because without ourselves and without linguistic knowledge,corpora are “nothing”; they are just texts.In terms of my contribution beyond the University and Norway,I'm co-editor,with Hilde Hasselgård,of the international journal for contrastive linguisticsLanguages in Contrast.I'm also a member of the ICAME board,and I collaborate with quite a few people,not only on the ICC.There's a great network of corpus linguists out there!
XLX:The last question is: do you have some advice to young scholars who wish to do corpus research,corpus-based contrastive linguistics in particular?
SOE:I think in a way that goes back to what I just said.Yes,corpora are very good.They give you empirical objective data but remember not to lose track of what we're really using them for,which is linguistic research.First of all,know your corpus.And know your corpus tool and what it can do for you or what you can use it for in order to find out something about language.So that's probably my main piece of advice.And obviously go beyond numbercrunching.Also,in recent years,the focus is more and more on statistics in corpus research.Young scholars should propably take this into account more than I have done so far,and maybe from an early stage find a collaborator who knows statistics.Especially for corpus-based contrastive linguistics,I always advise my students to try to use already-existing resources,because,as you know,it takes a lot of time to build your own corpus and also to use standard,existing tools.This not only goes for contrastive studies of course.But the thing about contrastive linguistics is that we need to take into account more than one language,which is twice the amount of work sometimes.And very often one language is better described than the other,typically English,so how can you give insights into both languages and not only one? So it is important to keep that in mind when you carry out contrastive research.And I think the method in contrastive linguistics is essential,because some people who are doing contrastive research are perhaps doing research that is closer to translation studies than contrastive linguistics.It's a fine line,but to be aware of these slight differences is important.
XLX:Very good suggestions.Do you have some advice ontertium comparationis?
SOE:I think people should actually pay more attention to thetertium comparationisthan is typically done.Because I've done a few contrastive studies based on comparable data,and it is harder to argue that I'm comparing like with like than when you use a bidirectional translation corpus.So I think this is essential and that's also part of the method.But I know there are quite a few people who criticise the use of translation for contrastive studies as translation is seen as the “third code”,that is,e.g.translated English is seen as being fundamentally different from the language originally produced in English.So very often if you use a corpus like the ENPC you have to argue that this is a good thing.I think precisely the presence of a strongertertium comparationisis a good argument,although people aren't always convinced.So it's a matter of showing that translation can be used in this way,in a systematic and sound way.
XLX:Thank you very much for your time.
Corpora
British Academic Written English (BAWE)
https://www.coventry.ac.uk/research/research-directories/current-projects/2015/britishacademic-written-english-corpus-bawe/
English-Norwegian Parallel Corpus (ENPC)
https://www.hf.uio.no/ilos/english/services/knowledge-resources/omc/enpc/
English-Swedish Parallel Corpus (ESPC)
https://www.sol.lu.se/engelska/corpus/corpus/espc.html
International Corpus of Learner English (ICLE)
https://uclouvain.be/en/research-institutes/ilc/cecl/icle.html
Lancaster-Oslo/Bergen Corpus (LOB)
http://clu.uni.no/icame/manuals/LOB/INDEX.HTM
Varieties of English for Specific Purposes dAtabase (VESPA) learner corpus
https://uclouvain.be/en/research-institutes/ilc/cecl/vespa.html
The Norwegian component of VESPA
https://www.hf.uio.no/ilos/english/services/knowledge-resources/vespa/