Icelandic Language Technology
SÍM (Consortium for Icelandic Language Technology) is a consortium of Icelandic universities, institutions, private companies and organizations that work on research and development within the programme Language Technology for Icelandic 2019-2023. SÍM works according to a contract with Almannarómur to develop core projects in language technology and submit deliverables (language resources, software and code) to Icelandic CLARIN.
Further information about the participants in the consortium can be found under the tab “About SÍM”, but Anna Björk Nikulásdóttir, project manager for SÍM, will answer all inquiries related to the programme and the consortium: anna@grammatek.com.
Here are all the software repositories associated with the deliverables of the programme, all software is released under open licenses.
Overview of core projects within the LT Programme
There are 6 core projects defined within the LT Programme which are to lay the foundation for the development of LT solutions for Icelandic. Here are links to all the software repositories of core projects that have been delivered so far in the programme:
Overview of Software Repositories
Language Resources
All language technology is based on language resources: texts and/or audio recordings. Those resources are necessary for language analysis, gathering vocabulary and finding rules and patterns in the language. Thus, based on language resources, it is possible to “teach” computers what matters for a specific type of software being developed, or let software find rules and patterns in a vast quantity og data. Work is being carried out within the LT programme on large collections of texts, and those collections are prepared for use in LT, both monolingual Icelandic texts and bilingual parallel corpora that contain Icelandic and English texts. There are also recordings of speech in large quantities, both through crowdsourcing Samrómur but also high quality recordings in a studio for the development of speech synthesizers. Work is also being done on databases that store information on individual aspects of the language, such as vocabulary, pronunciation and meaning.
Software Repositories for language resources
MIM-GOLD, training/testing sets
ParIce: English-Icelandic parallel corpus
Spell and grammar checking
Spell and grammar checking helps with correcting text, writing correctly and even in appropriate style. Spell and grammar checking also plays a crucial role in developing other LT software where errors in text can affect automatic text processing. The goal of the LT programme is to develop a general spell and grammar checker that can handle finding and correcting the most common errors found in Icelandic texts, and to create knowledge of the nature of typos that different groups make and develop methods to adapt the system to different needs, e.g. with regards to training and teaching.
Software Repositories for Spell and grammar checking
Support Tools
Although there are a lot of specific solutions within the world of LT, there are certain types of core software that are useful in all areas of LT. These are usually hidden tools that analyze basic units in texts, from analyzing what does and does not constitute a word, to analyzing complex grammatical and semantic context. All of these tools, which are not ready-made software solutions per se, but essential parts of language technology software and for data processing, are called support tools. Examples of support tools being worked on within the LT programme are a text tokenizer, a pos-tagger and parsers.
Software Repositories for Support Tools
Speech Synthesis
Speech synthesis turns written text into spoken language. The two main areas of speech synthesis software are reading and (voice) communication. Speech synthesizers are used to read text, for example from websites or even whole books. People who can not read themselves for some reason or have difficulty with it rely on speech synthesizer technology in their daily lives. Communication systems, where speech recognition detects what a user says, require speech synthesizers in order to respond with a voice. Within the LT programme, emphasis is placed on developing new speech synthesizer voices for Icelandic, e.g. so users can choose a voice they find pleasant to listen to.
Software Repositories for Speech Synthesis
Speech Recognition
Speech recognition revolves around turning spoken language into written language. It is a prerequisite to communicating with computers and devices in the way that is most natural for the majority of people: by talking. The aim of the LT-programme is to create a general purpose speech recognizer for Icelandic accessible through a web service. All methods and data will also be available as a foundation for the development of specialized speech recognizers.
Software Repositories for Speech Recognition
Machine Translation
Machine translations (MT) are automatic translations between languages. They have already become useful for various language pairs, both in helping people figure out the subject of texts in a language they can not read and to accelerate the work of translators in languages in which they are experts. However, no translation software as of yet can deliver translations that are close to a satisfactory level of quality, texts always need to be reviewed and fixed if the translation needs to be accurate. The aim of the LT-programme is to create an open MT-system capable of translating between Icelandic and English. It should be useful in translating texts of specific domains so that translators can complete texts faster.