Word Embeddings with Prior Knowledge Using Canonical Correlation Analysis

Eigenwords with Prior Knowledge

This archive contains word embeddings extracted from Wikipedia using canonical correlation analysis with prior knowledge encoding from external resources such as FrameNet, WordNet and PPDB.

Click for the following paper.


@inproceedings{osborne-16,
    author = "D. Osborne and S. Narayan and S. B. Cohen",
    title = "Encoding Prior Knowledge with Eigenword Embeddings",
    booktitle = "Transactions of the Association for Computational Linguistics",
    year = "2016"
}

The files in this directory are (click to download):

Eigenwords_Wiki5_alpha0.1_WordNetPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.1_PPDBPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.1_FrameNetPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.2_WordNetPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.2_PPDBPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.2_FrameNetPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.5_WordNetPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.5_PPDBPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.5_FrameNetPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.7_WordNetPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.7_PPDBPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.7_FrameNetPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.9_WordNetPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.9_PPDBPriorKnowledge.txt.gz
Eigenwords_Wiki5_alpha0.9_FrameNetPriorKnowledge.txt.gz

The format for each file is

[word] [vector]

where vector is a space-separated list of real 300 numbers.

The words used are the top 200k most frequenst words from the first 5 gigabytes of Wikipedia. More details about alpha and the external sources of prior knowledge are in the paper.