Comparing fastText and Word2vec for Synonym Extraction in Specialized Domains

Word embedding models have become essential tools in natural language processing, enabling machines to understand and manipulate human language with increasing sophistication. Among these models, Word2vec and fastText are prominent techniques for learning word representations from large text corpora. This article delves into a comparative analysis of these two models, specifically focusing on their effectiveness in synonym extraction, particularly within specialized domains such as radiation technology and medical imaging. By examining their performance across various synonym expression patterns and linguistic contexts, we aim to provide insights into the strengths and weaknesses of each model and guide practitioners in selecting the most appropriate technique for their specific needs.

Overall Performance of fastText Compared to Word2vec

Our analysis reveals that fastText generally outperforms Word2vec across a range of evaluation metrics for synonym extraction. Notably, the fastText model employing the Continuous Bag-of-Words (CBOW) architecture achieved peak performance at 300 vector dimensions. Interestingly, fastText with CBOW demonstrated remarkable stability in performance across different vector dimensions, with minimal variation in scores. In contrast, within the Word2vec framework, models utilizing the skip-gram approach surpassed those based on CBOW, except when using 100-dimensional vector representations.

Figure 1: t-SNE visualization comparing Word2vec and fastText performance in synonym extraction. The left panel displays Word2vec CBOW with 800 dimensions, and the right panel shows fastText CBOW with 400 dimensions. The visualization highlights the clustering of words related to “irradiation,” “contrast,” and “-graphy” in Japanese shortened forms.

These findings indicate that for synonym extraction in the radiation technology domain, fastText with the CBOW architecture, utilizing vector dimensions between 300 and 400, emerges as the most effective approach. However, it’s crucial to acknowledge the tendency for synonyms to appear in lower ranks. Therefore, practical automation of synonym extraction would necessitate retrieving terms from multiple ranks and implementing a robust filtering mechanism to refine the selection process and ensure accuracy.

Analysis of Synonym Expression Patterns: When Does fastText Shine Compared to Word2vec?

In evaluating seven distinct synonym expression patterns, fastText demonstrated superior performance compared to Word2vec in four categories: “transliteration variants,” “different Japanese forms with the same pronunciation and meaning,” “Japanese shortened forms,” and “plural expressions.” A common characteristic among these four categories is the limited difference in the number of characters within the synonym sets. In simpler terms, synonyms in these categories often share a common n-gram.

Figure 1 illustrates the distribution of word vectors using t-distributed Stochastic Neighbor Embedding (t-SNE), a dimensionality reduction technique. Focusing on “Japanese shortened forms,” the visualization reveals that fastText tends to cluster terms with shared n-grams more closely together. Conversely, in Word2vec, clusters of words, even those sharing n-grams, tend to be more dispersed and overlapping. This observation further supports the notion that fastText holds an advantage in identifying synonyms with common n-grams at higher ranks.

The optimal architecture (CBOW or skip-gram) and vector dimensions varied across different synonym set categories. Skip-gram proved adequate only for “different Japanese forms with the same pronunciation and meaning.” FastText with CBOW tends to favor words containing a common n-gram as synonyms, while skip-gram leaned towards extracting synonym pairs lacking a common n-gram. CBOW’s advantage becomes more pronounced as the number of synonym sets with shared n-grams increases. When such shared n-grams are less prevalent, skip-gram may perform comparably or even better. Regarding vector dimensions, increasing the number generally improved or maintained performance. While previous research suggests accuracy improves with higher vector dimensions, the optimal size can vary depending on the specific characteristics of the synonym sets, warranting further investigation.

Conversely, in categories like “conversion to transliteration,” “Japanese words and English acronyms,” and “Japanese and English words,” Word2vec exhibited comparable or even superior performance to fastText across four indices, depending on vector dimensions. However, the overall accuracy in these categories was lower, peaking at 50% for “Japanese words and English acronyms” and remaining below 30% for the others, falling short of the accuracy achieved in the initial four categories where fastText excelled. Synonym sets in these categories typically have few or no common character strings, posing a challenge for fastText. This highlights the need to explore methods for enhancing Word2vec’s accuracy in such scenarios.

Notably, when generating outputs involving English words and abbreviations, Japanese words tended to rank lower. This is likely due to the models being primarily trained on a Japanese corpus, limiting their exposure to English expressions. However, constraining the models to output only alphabet characters significantly improved accuracy for both English words and abbreviations. This suggests that specifying character sets can be a beneficial strategy when generating outputs in a language different from the training corpus.

Domain-Specific Synonym Expression Patterns

In domains such as “Image Engineering,” “Physical Phenomena,” “Equipment,” “Radiation Therapy,” “Medicine,” and “Imaging Diagnosis,” fastText with the CBOW model yielded optimal results. Synonyms in these fields commonly include words with shared n-grams. Conversely, in “Radiation Control” and “Imaging Diagnosis,” fastText with Skip-gram was preferred. This might be attributed to Skip-gram’s enhanced ability to detect words that pose difficulties for fastText CBOW, potentially contributing to improved evaluation indices in these specific areas. In “Informatics,” a significant portion of synonyms fell under “different Japanese writings of the same meaning” and “Japanese word and Japanese transliteration,” categories characterized by a lower frequency of shared n-grams. In situations with fewer common n-grams among synonyms, Word2vec may offer a relative advantage over fastText.

Comparison with Previous Studies: Contextualizing Our Findings

Comparing the skip-gram and CBOW architectures within Word2vec, previous studies have often indicated CBOW’s superior performance in tasks like similar word detection and text classification. In the fastText context, existing research suggests skip-gram may outperform CBOW, particularly in sentiment analysis-based classifications. Our findings align with these prior observations for Word2vec. However, in our study, the performance difference between the most accurate model (CBOW) and skip-gram was marginal, approximately 1.9%. This subtle difference suggests a potential inherent advantage for skip-gram depending on the specific task.

Regarding word embedding dimensionality, research has explored the relationship between accuracy and vector dimensions. Studies have indicated that accuracy generally increases with vector size, with some suggesting a performance peak around 300 dimensions. Our results are consistent with these general trends in word embedding research. However, it is important to note that previous studies did not specifically focus on medical terminology and differed in their task objectives compared to our synonym extraction research in specialized domains.

Conclusion

In conclusion, this comparative study highlights the nuances of using Word2vec and fastText for synonym extraction. While fastText generally demonstrates superior performance, particularly the CBOW architecture with vector dimensions around 300-400, the optimal choice depends on the specific synonym expression patterns and domain. fastText excels when synonyms share common n-grams, while Word2vec may be more competitive when such shared substrings are less frequent. Further research is needed to refine these models and explore hybrid approaches that leverage the strengths of both Word2vec and fastText to achieve even more accurate and robust synonym extraction, especially in specialized domains like radiation technology and medical imaging.