Listing 1 - 10 of 36 | << page >> |
Sort by
|
Choose an application
Choose an application
Choose an application
This report provides a new state of affairs with regard to language technology for Dutch, alanguage with approx. 25 million speakers. Language technology for Dutch is highly developed and the importance and status of Dutch is confirmed by other measurements, such asthe number of online sales, which is growing strongly, and so is the presence of Dutch online
Choose an application
Choose an application
Academic collection --- Conferences - Meetings --- Computational linguistics --- Netherlands --- Congresses
Choose an application
Deze thesis onderzoekt de mogelijkheid tot het voorspellen van punctuatie en segmentatie in het taalgebruik van mensen met een verstandelijke beperking. Dit met als doel een mogelijke implementatie in Text2Picto, een programma dat getypte tekst automatisch omzet naar pictogrammen of, omgekeerd, pictogrammen omzet naar getypte tekst. Deze software maakt het voor mensen met een verstandelijke beperking eenvoudiger om te communiceren via het internet. Omwille van het atypische taalgebruik van deze mensen, dat veel meer fouten bevat dan het taalgebruik van mensen zonder een verstandelijke beperking, is het niet evident om automatische punctuatie- en segmentatiepredictie toe te passen. In deze thesis wordt er gezocht naar de best scorende methode binnen het domein van machine translation, omdat deze in vorig onderzoek naar automatische punctuatie- en/of segmentatiepredictie de beste resultaten opleverde. De leestekens en segmentatie-annotaties worden op deze manier toegevoegd aan de input door het predictieproces voor te stellen als een monolinguale vertaling. Er wordt, met andere woorden, een vertaling tot stand gebracht van Nederlandse tekst zonder of met weinig punctuatie en segmentatie-annotatie(s) naar tekst in diezelfde taal met punctuatie en segmentatie-annotatie(s). Er werden hiervoor zes modellen met elkaar vergeleken, namelijk statistical machine translation en neural machine translation modellen, die beide getraind zijn op drie verschillende datasets. Deze drie datasets zijn de volgende: Sonar nieuwe media corpus, ondertitels uit SoNaR500 en een combinatie van beide. Na de modellen te laten leren uit bovengenoemde datasets gebeurde de evaluatie door middel van een evaluatieset. Dit is ongeziene data die door de modellen werd vertaald. Vervolgens werd die output automatisch geëvalueerd door de evaluatiemetrieken precision, recall en F-score. Hieruit bleek dat het neural machine translation model dat getraind is op enkel het Sonar nieuwe media corpus, voor het voorspellen van zowel punctuatie als segmentatie, als beste uit de strijd kwam.
Choose an application
In this paper, the technique of domain adaptation is applied to improve the statistical machine translation of a particular baseline engine. More specifically, cross-entropy and class based data selection are used to select the most appropriate data from an out-of-domain (OOD) corpus given an in-domain (ID) corpus. Both methods consist of selecting a list of k OOD sentences ranked according to how similar they are to the ID corpus. The descending amount of k best lines, with k each time divided by two, will, hypothetically, suit the ID data increasingly better and will each time discard ‘noise’ from the OOD data which lies furthest from the ID corpus. These different sized sets will be analysed to see if a certain threshold, or optimal k amount of sentences can be observed. In the MT engine, the data selected ‘pseudo in-domain data’ will be combined with the ID data in the tuning step, using log-linear interpolation or a back-off model. Finally, the translation evaluation scores will be compared to see which method yields the most optimisation with respect to the original baseline MT engine. This technique will be applied on the one hand to adapt the domain of an OOD corpus (Europarlv7) to a large legal ID corpus (Belgian Official Gazette) and on the other hand to adapt the domain of an OOD corpus (Digital Corpus of the European Parliament) to an ID corpus which has a similar (European) background and domain (Acquis Communautaire), both to see whether this technique still stands – theoretically – under these conditions.
Choose an application
This thesis addresses the task of detecting language boundaries in the code-switched "BeCoS" corpus, a collection videos of official press conferences from the Belgian Federal Government discussing the COVID-19 pandemic in Dutch, French, and German, accompanied by live sign language interpretation in either Flemish or Francophone sign language. However, the lack of transcripts for the spoken content limits its usability for various applications. To enable transcription for ASR, accurately detecting language boundaries becomes essential. Existing ASR systems assume a single language for an entire audio file, rendering them unsuitable for code-switched content. As a result, a pre-processing step is required to segment the audio into parts containing individual languages. This segmentation relies on language boundary detection, which marks the points in time where language switches occur. This thesis successfully addresses the task of detecting language boundaries as a multi-class classification problem, vital for providing accurate transcripts for code-switched ASR. The proposed Whisper-based approach offers promising results, laying the groundwork for future research to enhance code-switched ASR systems.
Choose an application
In recent years, increasing attention has been paid to the potential of using pictographs to open up the online world, so that users with intellectual disabilities can benefit from the same tools for remote communication (email, instant messaging, social media) which define so much of what it means to be a socially active member of society nowadays. This thesis describes the development of an automatic translation system that aims to enable language-impaired, intellectually disabled individuals to compose written messages simply by selecting a sequence of pictographic images. By way of contrast with existing approaches for pictograph-to-text translation, the system that we develop here, Depicto, takes a 100% rule-based approach. That is, all stages in the translation process make use of linguistic rules, as opposed to statistical data. Such an approach makes it possible to encode elegant generalizations about the pictographic input, which has advantages for the expressivity and consistency of translation. Rule-based approaches are generally also costly, however. Thus, aside from the obvious objective of testing whether this approach can actually be realized, we also explore how development can be made more feasible. In addition, we set two design criteria: first, the system must be sensitive to the needs of its users; second, it must be possible to extend to other target languages. (Currently, the system translates to Dutch.) In Chapter 2, we introduce all third-party resources used by the system. In Chapter 3, we show how the pictographic symbol set Sclera can be analysed as a natural language and how this language can be modelled by a constraint-based grammar. This grammar is written in an implemented variant of the HPSG framework. In the first half of Chapter 4, we see how the semantic structure of analysed pictographic sequences is translated so as to be compatible with the input expected by the target language grammar. This happens in the second module in the Depicto chain. In the second half of Chapter 4, we describe a basic grammar model of Dutch and show how this is used to generate well-formed sentences based on this translated semantic structure. Next, in the first half of Chapter 5, we evaluate the system as a whole. We argue that both its precision, i.e., ability to produce well-formed output, and performance are high, but are forced to concede that its coverage is limited. We also show how the Depicto system fares when pitted against a fundamentally statistical translation system. All in all, we conclude that Depicto is able to translate (a subset of) pictographic sequences into well-formed natural language sentences. By adopting an explicitly modular design, we try to make the system as appealing as possible to developers dealing with other target languages. At the same time, extending Depicto's analysis module remains a costly task. Whether such costliness is an adequate trade-off for the high quality of the system's output will be determined by future work. As a translation system, Depicto succeeds. However, as an assistive writing tool, it currently falls short. Its analysis module imposes rather stringent word order constraints on its input, and its output is an unfiltered list of all possible translation hypotheses. In future work, we will focus on minimizing the first limitation and removing the second altogether. This will most likely involve exciting experiments in hybridization.
Choose an application
Automatic speech recognition systems generate unpunctuated stream of words as output. This text is difficult to read and understand, degrading the performance of different machine tasks, such as; text mining, document classification and speech translation. This study presents different approaches to solve this problem: Machine Translation techniques and language modelling approaches. We compare these techniques, showing their advantages and disadvantages.
Listing 1 - 10 of 36 | << page >> |
Sort by
|