《未知世界:透过“大数据”棱镜窥探人类文化》述介
2015-03-29邵斌,陈晶晶
《未知世界:透过“大数据”棱镜窥探人类文化》述介
邵斌陈晶晶
(浙江财经大学,杭州,310018)
Aiden, Erez & Jean-Baptiste Michel.2013.Uncharted:BigDataasaLensonHumanCulture.New York:Riverhead Books.ISBN: 978-1594487453.pp.288.
在大数据时代,通过对海量数据的定量分析来揭示人类文化演变趋势的研究被称为“文化组学”(culturomics)。该概念源自哈佛大学的J.-B.Michel和E.L.Aiden研究小组于2011年在《科学》杂志(Science)上发表的《基于数以百万计数字化图书的文化定量分析》一文。之后,Aiden和Michel再度合作,于2013年出版了《未知世界:透过“大数据”棱镜窥探人类文化》一书①,详细介绍了“文化组学”研究及其应用。“文化组学”研究促成了自然科学和人文科学的联姻,促进“数字人文”(Digital Humanities)这一新领域的形成。本文旨在对该书进行简要述介,以期引起学界对“文化组学”领域的关注,从而把握大数据时代人文科学研究的新趋势。
1.内容简介
全书共分七章。第一章总体介绍“文化组学”的定义,即利用大数据对人类文化进行定量研究。著者认为,大数据将改变人文科学,改造社会科学,重新界定象牙塔内外世界的关系(8)。研究以“谷歌图书语料库”为基础,该语料库收录的是16世纪以来出版的、包含英、法、德、西、俄、汉和希伯来语等7种语言的3000万册图书的电子化文本,总计达5千亿词,占人类有史以来出版书籍的6%。谷歌图书语料库的文本纵贯5个世纪,故能反映出人类行为模式的变化、文化的变迁乃至文明的兴替,因此,它不仅是“大数据”,更是“长数据”(long data)。然而,由于受图书版权之限,研究者无法直接利用图书内容进行研究,为此著者开发了“谷歌图书N-gram②阅读器”(Google Books N-gram Viewer,以下简称N-gram Viewer),该阅读器可将语料库中的词汇每年的使用频率变化以曲线图形式进行可视化呈现。因此,它就像一面棱镜,借此可窥探人类文化的演变。
陈晶晶,浙江财经大学研究生。主要研究方向为语料库语言学、话语分析。电子邮箱:stephaniecjj@163.com
*本文为国家社科基金一般项目“基于英汉浮现词缀的语言演变模型建构研究”(编号14BYY001)的阶段性成果。
第二章通过语料库探索语法演变。语言是文化中较易界定的部分,故本书首先观察语言演变,具体个案是不规则动词过去式的规则化演变,研究焦点是动词使用频率与规则化之间的关系。研究发现,177个古英语中的不规则动词到中古英语阶段剩下145个,到现代英语中只剩下98个。著者计算得出:不规则动词的半衰期③与其使用频率的平方根成反比。假设动词A的频率是动词B的1/100,则其规则化速度是后者的10倍。著者还进一步计算出,像chide和shrive等频率介于10-6至10-5区间内的动词半衰期约为300年,而像drink和speak等频率介于10-3至10-2区间内的动词半衰期则约为5400年。本章最后总结道:数据自己会说话。人类语言也会发生自然选择,而使用频率是决定英语不规则动词能否存活的最重要的因素(43)。
第三章通过语料库探索词典编纂的“盲区”,即未被词典所收录的词汇。首先,研究发现,大部分英语词典仅收录高频词汇,而占词汇总量52%的低频词则未能进入词典,它们构成了词库中的“暗物质”(lexical dark matter)④。由此,著者认为,英语词汇在某种程度上仍是一片“未被发现的大陆”(76)。其次,由统计可得,1900年前后,英语词汇总量已逾55万词,至1950年仅增至60万,到2000年则增至100万词,现今每年新增8400词左右,可见词汇呈加速增长趋势。此外,研究还发现,词典学家虽竭力追踪新词,但词典仍无法及时反映英语词汇的最新变化。以2000年出版的《美国传统词典》第四版为例,它收录的新词有mesclun、netiquette、amplidyne等,但借助N-gram Viewer可知,mesclun和netiquette两词在1992年时的频率就已达到被该词典收录的标准,而amplidyne早在1950年就已达到频率峰值,在2000年则已成为旧词。由此可知,通过N-gram Viewer可定位词汇的“兴衰”,促进词典的更新,探索词汇的“未知世界”。
第四章通过N-gram Viewer来计算名气。如果将人的名气视作是其名字在谷歌图书中出现的频率,则名气可加以计算。总体而言,谷歌图书中人名频率曲线呈现某种共性,即都包含初次成名、快速增长、达到巅峰以及缓慢衰落这4个阶段。著者通过以下5个具体方面来测算名气:(1)初次成名时的年龄;(2)名气翻倍所用的时间;(3)名气达到巅峰时的年龄;(4)名气的半衰期;(5)名气与职业的关系。研究发现:人的名气达到巅峰时,其年龄一般稳定在75岁,但其他方面则有历时变化。以1800年和1950年作为先后考察时间点,人们初次成名的年龄从43岁降至29岁,名气翻倍所需时间从8.1年减至3.3年,名气半衰期从120年跌至71年。简言之,现代人出名更早,成名更快,但被人遗忘也更快了。就名气与职业的关系而言,研究也有惊人发现。数据显示,演员成名一般在30岁左右,成名最早;作家成名在40岁左右,最终名声更盛,且持续时间更长;政治家成名在50岁左右,成名虽晚,但名声最盛;科学家成名则在60岁前后;艺术家和数学家成名几率最小。由此可见,N-gram viewer将名气这一主观化事物进行定量化和客观化测算了。事实上,Veres和Bohannon(2011)已通过定量研究对4000多位科学家的名气进行排序,并在《科学》杂志上发表了“科学名人堂”一文,本章可视作是对该文的拓展。
第五章展示如何通过N-gram Viewer追踪出版审查制度和政治压制。假设语料库中的某些词汇或人名在某一时段内突然“销声匿迹”,则很可能是因为这些词汇或人名在书籍中被禁用。著者通过比对德语和英语的谷歌图书来考察纳粹德国时期的审查制度和政治压制。谷歌图书显示:犹太画家Marc Chagall在1910年前后开始成名。但是,在英语图书中,其名气持续上升,而在德语图书中,其名气在1936年至1944年期间却跌至低谷,显然这是因纳粹德国对犹太人的迫害而导致该画家被“消音”。在历史上,有些政治压制规模大,涉及人数多,被压制者虽被列入“黑名单”,但却未必记录在案,譬如斯大林时期的苏联大清洗运动以及美国“好莱坞十君子”事件中的政治审查。然而,借助N-gram viewer对词语或人名频率变化的考察,可以自动监测到某个人或某种思想是否遭受过审查或压制。
第六章是通过大数据研究集体记忆和集体遗忘。著者指出,像集体记忆这样的概念以往通常被排除在科学调查之外,而通过N-gram Viewer对其进行研究也并非难事(153)。著者以年份数字为例来探究集体记忆的特点,通过该年份数字的频率变化来观察该年度的事件是如何被人们所记忆的。研究表明,人们对某一年份的遗忘速度呈现先快后慢的特点,符合艾宾浩斯遗忘规律。然而,随着社会发展,人们遗忘的速度越来越快,很快便对过去的事物失去兴趣。譬如,1872这一年份数字的半衰期为24年,而1973年份的半衰期仅为10年。著者也考察了与集体遗忘相对的“集体学习”的形成过程,即新事物如何进入人的“集体意识”。著者以维基百科全书中147项发明专利为例来观察新事物被大众接受的过程,统计发现,在19世纪初,先进技术需要经过65年左右才能被主流文化所接受,而到20世纪初,仅需26年即可,可见人们对新事物的接受速度越来越快。
第七章是对大数据外延的拓展。著者认为,谷歌图书对大数据而言也只是冰山一角。以后,报纸、手稿,甚至实物,都会进入数字化处理,从而会形成大数字人文。如美国作家爱伦·坡遗留的422封信件展现了其创作过程,他旧居中的旧物反映了其创作环境,而这些实物数据目前尚未被谷歌图书所收录。一旦将这些资源数字化,这些数据将和谷歌图书项目一道共同组成反映人类文化变迁的一面棱镜,折射出人类历史长河的方方面面。大数据不仅能记录过去,审视现在,更能预测未来。因此,最后著者得出“数据即力量”的结论。
2.简评
该书的亮点主要体现在以下两个方面。
第一,沟通科学和人文,促进“数字人文”发展。早在几年前,哈佛大学的Gary King教授就曾预言,随着大数据的出现和使用,整个社会科学研究的实证基础将会出现重大的变化,甚至会加速定性与定量研究的大融合(King 2009)。本书借助定量分析,探索了语法演变、词典编纂、名气测算、审查压制以及集体遗忘和集体记忆这些人文社科领域的重要话题。在传统的观念看来,这些领域很难开展定量研究,但本书通过庞大的数据库较为客观地将其加以呈现。可以说,“文化组学”为人文科学研究提供了一种全新的研究方法,促进了“数字人文”学科的发展。短短两三年来,国外已有学者采取“文化组学”视角探索情感挖掘、冲突预测、大学排名变化、气候演变、复杂关系测算等多个领域的研究,相关论文不下百篇,可见该书影响之巨大,意义之深远。
第二,注重读者友好,语言通俗易懂。本书将大数据引进人文科学领域研究并提出“文化组学”概念,但全书并未充斥专业术语,而是以普及的立意和通俗的语言将大数据在人文研究中的应用娓娓道来,不让人望而生畏。该书在每章都设立一个研究问题,并详细介绍与该问题有关的理论背景和相关知识,阐述时多以故事形式和譬喻方式帮助读者理解研究问题。由于本书著者为自然科学领域的学者,因此,他们在文中偶尔会借用一些自然科学的概念,如“暗物质”、“半衰期”、“基因组”等等,但用得恰到好处,而且解释到位,明白易懂。因此,本书适用读者群并不局限于语言学专业读者,对文化感兴趣的一般读者也能从中受益。
本书也有两点不足之处:
第一,基于N-gram Viewer的研究脱离语境,有时不免以偏概全。N-gram Viewer过分倚重词汇频率分析,而无法考察词汇所在的语境。譬如,在探讨名气时,谷歌图书中人名的出现频率只能衡量名气的大小,而无法判断名气的好坏。此外,单纯用词频来代表文化影响力虽是一种易于操作的办法,但仅通过曲线难以判断该变化是否具有显著性。如果能辅以一些统计方法对这些N-gram viewer数据进行深加工,研究则可进一步深化,如Acerbi等人(2013)结合情感词库(WordNet Affect)和波特算法(Porter’s Algorithm)对20世纪英语谷歌图书中的情感表达变化进行研究,即为一例。
第二,某些语言语料数量不足,语料库的代表性不够。谷歌图书语料库中英语图书数量巨大,达到3500亿词,但汉语图书词数只有130亿词,相对于浩如烟海的汉语书籍而言,这一数量远远不足。换言之,该汉语图书语料库的代表性不够充分,不免影响研究结论。譬如,在汉语谷歌图书语料库中查“孔子”和“孟子”两人名,前者在1800年之前鲜有出现,后者则更是迟至1927年才首次被提及,这一结果显然不符合事实。而这是因每个历史阶段的汉语语料不够均衡所致。
虽然存在上述不足之处,但瑕不掩瑜。本书作为第一本系统阐释“文化组学”概念并介绍其应用的著作,必将在大数据发展史上留下浓墨重彩的一笔。事实上,在过去几年中,国内已有人文学者对“数字人文”开始关注,如张隆溪(2011);甚至已有学者借助“文化组学”视角对百年来的社会学发展进行了追踪,如陈云松(2015)。但整体而言,国内的相关研究尚未开展。因此,本文希望引起学界对“文化组学”研究的关注,也期待有更多的学者投身于大数据研究,来探索人文社会科学领域的“未知世界”。
附注
① 下引此作仅注页码。
② N-gram一般译为“N元组”,指的是从语料库中提取出的一词或多词序列,即单词或词组。在该研究中,N的范围被限定为1~5。换言之,N-gram可包含1-gram至5-gram,如“America”、“United States”或“the United States of America”等都包含在内。谷歌图书的20亿个N-gram可在以下网站检索并下载:https:∥books.google.com/ngrams/。
③ 著者Aiden和Michel都具有理工科教育背景,因此在论述中时常借用自然科学领域的术语。半衰期原指放射性元素的原子核有半数发生衰变时所需的时间,此处借指“频率减少至半所需的时间”。
④ 著者把频率界限设定为谷歌图书中每10亿词中出现1次,即10-9,低于该值即为低频词。
参考文献
Acerbi, A., V.Lampos, P.Garnett & A.Bentley.2013.The expression of emotions in 20th century books [J].PLoSONE3: 1-6.
King, G.2009.The changing evidence base of social science research [A].In G.King, K.Schlozman & N.Nie.TheFutureofPoliticalScience: 100Perspectives[C].New York: Routledge.91-93.
Michel, J.-B., Y.K.Shen, A.P.Aiden, A.Veres, M.K.Gray, T.G.B.Team, J.P.Pickett, D.Hoiberg, D.Clancy, P.Norvig, J.Orwant, S.Pinker, M.A.Nowak & E.L.Aiden.2011.Quantitative analysis of culture using millions of digitized books [J].Science331(6014): 176-82.
Veres, A.& J.Bohannon.2011.The science hall of fame [J].Science331(6014): 143.
陈云松.2015.大数据中的百年社会学——基于百万书籍的文化影响力研究[J].社会学研究(01):23-48.
张隆溪.2011.人文研究与电子信息技术[J].书屋(10):52-54.
(责任编辑玄琰)
Abstracts of Major Papers in This Issue
English Education: Needs and Mission, by YE Xingguo, p.1
This speech begins with the exploration of the evolution of English education in China, probes into the relationship between English education and state’s needs, analyses the significant contribution of the English education to the realization of the state strategies, and concludes that a university shall, at the juncture of promulgation ofNationalCriteriaofTeachingQualityforBachelorDegreeForeignLanguagePrograms, find “niche” or specific state needs and aim at satisfying them through working out its own criteria of English teaching quality.
On Collaborative Innovation of Translation Education under the New Normal, by YE Xingguo, p.5
The speech sets forth the new normal of the translation circles, analyses the different value orientations of the subjects of the collaborative innovation, namely the relevant circles of administration, enterprises, education, research and clients, and points out the six main problems and their solutions.
Innovation of English Teaching under the New Normal, by YE Xingguo, p.9
The speaker talks about how the new domestic needs,international situation and ICT development are challenging English teaching, why new ideas, standards and methods shall be applied and what the new normal of English teaching is, and emphasizes the importance of keeping up with the times and teaching innovation.
An Overview of Linguistic Landscape Study in China and the Prospect, by ZHANG Baicheng, p.14
Linguistic landscape study in China dates back to 1980s.In the past forty years, Chinese scholarsv have achieved remarkable progress in this domain, and the numerous studies mainly cover three themes: (1) Linguistic landscape translation and the norms; (2) Features of domain-specific linguistic landscapes; (3) Theory and methodology in linguistic landscape study.The studies investigate many types of linguistic landscapes including public signs/labels, publicizing language, slogans, street/road/store/institutional names, and couplets.The limitations of the studies lie in the four aspects: emphasizing description but ignoring interpretation, inadequacy of theoretical and methodological explorations, and not paying enough attention to multimodal signs per se.Future study can be furthered through focusing on five aspects, including shifting the research focus, exploring the theoretical and methodological issues and so on.
On the Nature of Middle Verbs and Middle Constructions, by YANG Yongzhong, p.19
Middle constructions are a well-studied topic in linguistics.Based on a summary of the properties and features of middle verbs, this paper proposes that middle constructions are composed of two verbs, of which the first verb, serving as the predicate, denotes an action characteristic of conventional property or features, while the second verb, serving as a complement clause, denotes result.The combination of the two verbs denotes a complete event.Based on this, it is argued that all middle verbs must be of this nature in terms of underlying structure.Once this has been accepted, many long-standing puzzles related to middle constructions are solved quite readily.
A General Review of Dynamic Assessment and Second language Learning, by WANG Hua, p.25
In dynamic assessment, important information about a learner’s abilities and changes can be learned during the assessment.Dynamic assessment is a procedure for simultaneously assessing and promoting development of learners’ cognitive procedure and ability, which is confirmed and applied in the research on second language teaching and learning.This study is a brief review and comment on the research of dynamic assessment and its application in second language learning based on a wide retrieval of literature.
Innovation or “Old Wine in a New Glass”—On the Use of Neologism in Skinner’sVerbalBehavior, by JIANG Daohua, p.31
Verbalbehavior, from the functional perspective, analyzes the cause-and-effect relations of human verbal behavior, in which the key to understand its theoretical framework is on the use of neologistic terms.Taking it as the point-of-departure, the paper discusses the misunderstandings of Skinner’s behavioral theory initiated by Noam Chomsky and points out the great innovation and insightfulness of Skinner’s work.
A Study on English Majors’ Pragmatic Awareness in English Gratitude Context: Sex Roles and Social Situations, by CAI Chen & WANG Yinyin, p.46
In this artide, we found that more and more males and females show an androgynous characteristics and that the masculinity has a higher sensitivity on pragmatic awareness than the femininity.Participants show different pragmatic awareness on social situations and demonstrate significant difference on the perception of the burden of kindness.Meanwhile, sex roles still have different perceptions on the same social situation.The results reveals that participants construct their pragmatic awareness in the communication process.A successful communication requires the participants to improve their sensitivity on the differences of sex roles and social situations, so the intercultural communication teaching shall concentrate on cultivating students’ critical inter-cultural communicative competence.
Role of Information Grounding in Literary Translation for Discourse Structuring: A Study of Three English Translations of a Chinese Prose “Zuiwengting Ji”, by LI Ming, p.60
Any discourse, a conglomeration of different sentences, features background information as well as foreground information both at the clause level and at the discourse level.The information which knits the thread of a discourse and which moves the discourse forward is called foreground information and the information which does not immediately and crucially contribute to the speaker’s goal, but which merely assists, amplifies, or comments on it is background information.Information grounding theory holds that an acceptable discourse results from the modulation of both foreground information and background information.The present paper, by taking three translations of the first paragraph of the Chinese literary discourse “Zuiwengting Ji” as an instance and through extracting and back-translating into Chinese their respective foreground information, aims to make readers fully aware of the important role that foreground information plays in achieving global coherence in discourse structuring.
On Mental Access between Topic and Subject in Text Translation, by ZHONG Shuneng, YOU Liping & ZHANG Yunxia, p.65
It is revealed that a topic finds its way in a Chinese text by means of an NP, a pronoun or a zero-form and works as a subject, on the one hand.On the other hand, a topic is embodied in a corresponding English counterpart in the form of either an NP or a pronoun and functions as a subject.The present paper indicates that a topic chain is established by means of metonymy when a topic accesses itself to a series of subjects.It is concluded by claiming that the topic chain plays a crucial role in developing a naturally coherent text.
Exploration on the Compilation Method of Special Dictionaries Based on the English-Chinese Parallel Corpus, by ZHANG Yushuang & GUAN Xinchao, p.69
This article describes the compilation method of special dictionaries based on the English-Chinese parallel corpus.In comparison with the traditional compilation method, the corpus-based method can improve dictionary’s systematization and standardization, whatever its size is.How to choose corpus texts and how to do word frequency statistics etc are the key of compilation.The meanings of general words in this kind of dictionary will contribute to learning functions and are useful for understanding of specialties.The determination of true special terms depends upon the compilation goal and dictionary users etc.The corpus-based dictionary can also provide a linking service for special dictionaries.Certainly, there lies disadvantages by compiling the dictionary in this way and should be treated carefully during the compilation process.
作者简介:邵斌,浙江财经大学副教授。主要研究方向为语料库语言学、词汇语义学、认知语言学。电子邮箱:seesky1978@163.com