-rali.iro.umontreal.ca/arc-a2/BAF/The BAF Corpus is a corpus of French - English bi-texts, i.e. of pair of French and English texts which are mutual translations, and whose sentences have been aligned. This corpus has been built up by the CITI computer assisted translation group (TAO). Most of the texts are of institutional genre (canadian HANSARD, ONU reports, etc.), but a few scientifical papers and a literary work were also included. The whole corpus has about 400.000 wors for each language. BAF Version 1.1. is already available and can be freely downloaded in UNIX GZ format, ZIP and each file separatedly in TXT and CES formats. Description, allignment conventions, encoding documentation, and a COAL Tools suite, are also freely available on the site. [2001 April 23].
German Ii Germanic Language Of 128 Million Pdf Free
The University of Maryland Parallel Corpus Project is acquiring and annotating texts in order to create multilingual corpora for linguistic research, particularly computational linguistics. Religious texts such as the Bible are widely available, carefully translated, and appear in a huge variety of languages. The MPCP provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadle PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. [2001 May 1].
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).The CHILDES system provides tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video.The CHILDES database (cf. this link) is a large group of Children's Spoken and Written Language Corpora, all freely available for PC or MAC. It includes a vast amount of transcript data collected from children and adults who are learning languages.All of the data are transcribed in the CHAT format which makes them easily analyzed by using the CLAN programs. A 2.5 MB PDF manual of the CHILDES corpora is freely available (at this address), as well as the CLAN concordancer for accessing the data and his manual.CHILDES corpora cover a 23 European and extra European languages: Cantonese, Catalan, Danish, Dutch, Estonian, French, German, Greek, Hebrew, Hungarian, Irish, Italian, Japanese, Mambila [Bantu], Mandarin, Polish, Portuguese, Russian, Spanish, Swedish, Tamil, Turkish, Welsh. The bulk of the collection is however English (see under the English section).There is also a remarkable Multilingual Collection (English UK and USA, German, Hebrew, Italian, Spanish, Swedish and Turkish), made from narratives elicited using Mercer Mayer's "frog story" picture book.All materials are freely available directly from the Site; moreover texts are also downloadable by Contact: CHILDES Project, Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15213, USA; e-mail to brian@andrew.cmu.edu.
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan). The primary focus in this effort is on textual material of all kinds, including transcriptions of spoken material. ECI/MCI has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). A complete list of the contents is available following this link.Unusually cheap: the ECI/MCI is available directly from ECI at a price of 95 DFl (for payments made by credit card or Eurocheque); 110 DFl (for payments by bank transfer); or 120 DFl (for payments by cheques other than Eurocheques). Need only to sign a license agreement available (Postcript or LaTex version) at this address or this other one.It is also available at 35$ price (or trough membership) from the LDC in a CD-ROM in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least: cf. this page.
The EMILLE Project is a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK. The project will establish an LE architecture within which minority LE may take place. EMILLE will extend GATE (the General Architecture for Text Engineering) to be fully UNICODE compliant so that it may act as a framework within which the corpora of EMILLE can be both developed and exploited. GATE will be extended at Sheffield, in close liaison with the Lancaster team, to meet the needs of EMILLE. GATE was first released in 1996 and has since had a wide take-up in language processing laboratories around the world (Cunningham, Gaizauskas, Humphreys, and Wilks, 1999).EMILLE will generate written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu. These are the Indic languages indicated as being those most wanted by the LE community in the Baker & McEnery (1999) survey. For those languages with a UK community large enough to sustain spoken corpus collection (Bengali, Gujarati, Hindi, Panjabi and Urdu) EMILLE will also produce spoken corpora of at least 500,000 words per language. The written corpus data will be contain at least 200,000 words of parallel text. The remainder will be monolingual corpus data. The monolingual written corpus will attempt to shadow the composition of the BNC (British National Corpus) in terms of genres as far as possible. The spoken corpus data will be gathered from communities across the UK data on mini-disks. The digitised sound wave of the minidisks will be stored and released as part of the final project deliverables. Note that the use of digital media to collect the data will ease the transfer of the data to computer. The data will also be transcribed.EMILLE will publish the corpora on a web site for downloading, one of the favoured distribution formats reported by the Baker et al (1998) review of corpus validation. The Department of Linguistics at Lancaster University has undertaken to maintain the web site beyond the life of the EMILLE project. ELRA (an EMILLE partner) has agreed to organise distribution of the project resources on CD. The corpus will be accompanied by a handbook, analogous to the BNC user reference guide, which will give details of the sources individual corpus texts were gathered from etc. [2001 June 17].
The European Language Newspaper Text corpus is also known as the French Language News Corpus. This corpus includes roughly 100 million words of French, 90 million words of German and 15 million words of Portuguese and has been marked using SGML. The text is taken from the following sources: [1] - Approximately 60 million words of text in French and German have been made available from the Associated Press (AP) World Stream. AP World Stream is a compilation of AP news reports produced in 86 bureaus in 68 countries. The Associated Press Worldstream newswire service provides articles in six languages, interleaved on a single data stream. The data is collected via an Associated Press installed telephone line at the LDC. [2] - Approximately 110 million words of text in French, German and Portuguese have been made available from Agence France Presse. Each language was supplied in separate data streams collected via a Dateno MKII satellite receiver and associated equipment at the LDC. [3] - Approximately 20 million words of text in German have been made available from Deutsche Presse Agentur. The text is collected via an AP Datafeatures telephone line installed at the Linguistic Data Consortium. [4] - A smaller part of the corpus comes from Le Monde newspaper. The Le Monde data covers about 65 million words of French. It is quite distinct from the AP and AFP materials in its markup approach, because it has been prepared in compliance with the conventions of the Text Encoding Initiative (TEI), rather than having been based on the model of the TIPSTER collections, which were originally developed prior to the establishment of the TEI conventions.Available only by LDC membership, cf. this link.
The Hansard Corpus consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament. While the content is therefore limited to legislative discourse, it spans a broad assortment of topics and the stylistic range includes spontaneous discussion and written correspondance along with legislative propositions and prepared speeches. There are several different versions.+ Hansard Treebank (The Canadian Hansard Treebank). A skeleton-parsed parallel corpus (English-French) of proceedings in the Canadian Parliament. 750,000 words.+ Hansard LDC Parallel Corpus (The LDC Canadian Hansard Treebank). The collection presented here has been assembled by the LDC by way of archives from two distinct secondary sources. Material from one time period of parliamentary proceedings was acquired through the IBM T. J. Watson Research Center, while material from another period was acquired through Bell Communications Research Inc. (Bellcore). The combined collection covers a time span from the mid-1970's through 1988, with no apparent duplication between the two data sources. Aside from covering different time periods, the two archives have different organization and have undergone different amounts and kinds of processing in being prepared as a parallel language resource. In addition, the Bellcore set itself comprises two distinct types of data -- one appears to be the main parliamentary proceedings (similar in nature to the IBM set), while the other consists of transcripts from committee hearings. The three sets have been kept distinct in this publication and each is described in greater detail in separate documentation files on the CD-ROM.Available only by the LDC through membership or 5000$ price, cf. this link.+ TransSearch Hansard (texts 1986-1993). In this free not tagged online version, elaborated from the RALI, you can specify a word or an expression, in English or in French: TransSearch will look for contexts where this expression appears, and show you the corresponding context in the other language. 2ff7e9595c
コメント