Michael
•
7th May 2020
•
No Comments

In our last blog post, we talked about machine translation and CAT tools, both of which play a significant role in the way the modern translation industry works. Today, we are going to discuss the backstage of machine translation applications and programmes. As we mentioned in the previous blog, MT applications use databases called ‘corpora’ or ‘corpuses’.

Corpora play a key role in machine translation. As a rule of thumb, the better corpora you have, the better the results you’ll obtain. One of the reasons why different MT tools translate the same text in different ways is because they are using different corpora. (There are also other reasons, but that’s a subject for a separate article.

Polish Corpora: databases for machine translations

Let’s take a closer look at Polish language corpora. The linguistic corpus is a dataset of representative words, sentences and phrases in a given language. Usually, the corpus consists of books, magazines, newspapers and internet portals published in a given (in our case Polish) language. Sometimes, it may contain more colloquial phrases, for instance those originating in chats and Internet communications.

Corpora are sometimes called authentic language materials because they show what a language really looks like – the kind of words and phrases that are in constant use in everyday life. Corpora are also essential in machine translations, providing the source, the database, that machine-learning algorithms use to translate the given text or other pieces of content.

There are at least 20 different Polish language corpora available on the market, but we will look at three of the most significant and extensive ones.

National Corpus of Polish

This is the largest and most important Polish language corpus. It consists of almost 2,700 books and more than 340 newspapers and magazines and contains a whopping 1.5 billion terms.[1] In the words of the NCP website:

‘The National Corpus of Polish is a shared initiative of four institutions: Institute of Computer Science at the Polish Academy of Sciences (coordinator), Institute of Polish Language at the Polish Academy of Sciences, Polish Scientific Publishers PWN, and the Department of Computational and Corpus Linguistics at the University of Łódź. It has been carried out as a research-development project of the Ministry of Science and Higher Education.

These four institutions have started cooperation to build a reference corpus of Polish language containing over fifteen hundred millions of words. The corpus is searchable by means of advanced tools that analyse Polish inflection and the Polish sentence structure.’

Forty-three linguistic developers are continually working on this project!

Polish Wikipedia Corpus

Also worth mentioning is the linguistic corpus that consists of all Polish Wikipedia entries.[2] This is made up of 895,486 articles with around 169 million words – that’s over 1GB of text in total! It is important to note that it contains only ordinary items – there are no stubs, templates, disambiguation pages, history of changes, etc., and all multimedia, tables, references, links and other non-plain-text elements have been removed.

The newest available version of this corpus was published in 2013, which makes it a bit outdated.

PWN Corpus

This is the corpus published by the PWN Group (Polish Scientific Publishing House). According to the PWN website,[3] its main corpus contains 70 million words in current use. There is also a much more extensive, entire corpus that includes press archives and literature dating back to the Middle Ages. This contains 100 million words and comprises books, magazines, leaflets, advertisements and websites.

Without these three corpora, Polish machine translations would have never come into existence.

Other corpora available in the Polish language include:

Polish Parliamentary Corpus
Polish Summaries Corpus
Polish Coreference Corpus
National Photocorpus of Polish
PICLE Corpus, the Polish sub-corpus of the International Corpus of Learner English (ICLE).

While we are on the subject of machine translation, remember that we offer a post-editing service. If you have an MT-translated text that you need to ‘Polish’ up, so it’s 100% correct and natural, drop us a line: we’ll be more than happy to help you!

[1] nkjp.pl

[2] http://clip.ipipan.waw.pl/PolishWikipediaCorpus

[3] https://sjp.pwn.pl/korpus

Polish Corpora: databases for machine translations

National Corpus of Polish

Polish Wikipedia Corpus

PWN Corpus

Post Tags :

Aids in Polish Translations

Train with FizzUp in Polish!

Leave a Reply Cancel reply

Polish Corpora: databases for machine translations

National Corpus of Polish

Polish Wikipedia Corpus

PWN Corpus

Post Tags :

Share :

Aids in Polish Translations

Train with FizzUp in Polish!

Leave a Reply Cancel reply