UniCat-Search

Union Catalogue of Belgian Libraries

English | Nederlands | Français

Feedback

About UniCat

Help

News

Narrow your search

Library

ULiège (1)

Resource type

dissertation (1)

Language

English (1)

Year

From To

2022 (1)

Listing 1 - 1 of 1
Sort by

Dissertation

Master thesis : Generating Topic Models from Corpora Across Languages
Authors: Thielen, Benoit --- Ittoo, Ashwin --- Geurts, Pierre --- Louppe, Gilles
Year: 2022 Publisher: Liège Université de Liège (ULiège)

Abstract | Keywords | Export | Availability | Bookmark

Loading...

Export citation
Choose an application

Reference Manager

EndNote

RefWorks (Direct export to RefWorks)

Bookmark

Abstract
Topic modeling is a learning process aiming to analyze texts to discover their topic composition by associating groups of correlated words. Historically, topic modeling has used unsupervised learning techniques. Bayesian generative models, such as Latent Dirichlet Allocation (LDA), have quickly proven their performance for representing with probabilities the distributions of words across topics and of topics across documents. Recently, new topic models based on LDA have emerged, like the Hierarchical Dirichlet Process (HDP) which self-determines the number of topics in the text and the nested Hierarchical Dirichlet Process (nHDP) which enables a hierarchical representation of the topics.The performances in topic identification and hierarchical modeling of HDP and nHDP were evaluated in this work, on English and French corpora built from Wikipedia articles. A large number of very coherent and interesting topics were detected in both languages, despite the presence of some less coherent ones. Correlations have been highlighted between the statistics of the corpus and evaluation metrics such as coherence and model perplexity.Additionally, a more recent approach of learning word embeddings in hyperbolic space, specifically in the Poincaré ball space, has been studied to determine if it could constitute a promising approach to hierarchical topic modeling. Poincaré embeddings of 10 dimensions were trained on hypernymy relations of our English corpus. Our analysis revealed clusters of words which can be linked to topics, unfortunately the 2D representation method we applied did not allow to show hierarchical relations between those clusters.In conclusion, both HDP and nHDP models have shown good and similar learning performances when trained on French and English corpora, nHDP being also efficient in providing hierarchical representation of the topics. The Poincaré embeddings were successful in learning and representing the hypernymy relations in the Poincaré ball, however suffered from the constraints imposed by the data acquisition methods and required filtering processes.

Keywords
LDA --- HDP --- nHDP --- Poincaré --- embeddings --- topic --- modeling --- Dirichlet --- Ingénierie, informatique & technologie > Sciences informatiques

Listing 1 - 1 of 1
Sort by