edu.northwestern.at.morphadorner.corpuslinguistics.lexicon

Class AbstractLexicon

    • Field Detail

      • lexiconMap

        protected java.util.Map<java.lang.String,LexiconEntry> lexiconMap
        Map in which to store lexicon entries.

        An entry (e.g., word spelling) is the key, and a LexiconEntry is the value.

      • categoryCountsMap

        protected java.util.Map<java.lang.String,MutableInteger> categoryCountsMap
        Map from part of speech tags to their frequency in the lexicon.
      • uniqueEntryCountForCategoryMap

        protected java.util.Map<java.lang.String,MutableInteger> uniqueEntryCountForCategoryMap
        Map from part of speech tags to frequency of unique word entries in the lexicon with each tag.
      • longestEntryLength

        protected int longestEntryLength
        Length (in characters) of the longest entry in the lexicon.
      • shortestEntryLength

        protected int shortestEntryLength
        Length (in characters) of the shortest entry in the lexicon.
      • partOfSpeechTags

        protected PartOfSpeechTags partOfSpeechTags
        Part of Speech tag set used by lexicon.

        Note: all tags in the lexicon must appear in this list!

      • logger

        protected Logger logger
        Logger used for output.
    • Constructor Detail

      • AbstractLexicon

        public AbstractLexicon()
        Create an empty lexicon.
    • Method Detail

      • getLogger

        public Logger getLogger()
        Get the logger.
        Specified by:
        getLogger in interface UsesLogger
        Returns:
        The logger.
      • setLogger

        public void setLogger(Logger logger)
        Set the logger.
        Specified by:
        setLogger in interface UsesLogger
        Parameters:
        logger - The logger.
      • updateCategoryCount

        protected void updateCategoryCount(java.lang.String category,
                               int count)
        Add or update category counts map.
        Parameters:
        category - Category for which to add/update count.
        count - Category count to add to entry. May be negative.
      • incrementUniqueEntryCountForCategory

        protected void incrementUniqueEntryCountForCategory(java.lang.String category)
        Increment number of unique entries for a category.
        Parameters:
        category - Category for which to increment count.
      • updateEntryCount

        public void updateEntryCount(java.lang.String entry,
                            java.lang.String category,
                            java.lang.String lemma,
                            int entryCount)
        Update entry count in lexicon for a given category.
        Specified by:
        updateEntryCount in interface Lexicon
        Parameters:
        entry - The entry.
        category - The category.
        lemma - The lemma.
        entryCount - The entry count to add to the current count. Must be positive.
      • removeEntryCategory

        public void removeEntryCategory(java.lang.String entry,
                               java.lang.String category)
        Remove given category for an entry.
        Specified by:
        removeEntryCategory in interface Lexicon
        Parameters:
        entry - The entry.
        category - The category to remove.

        If the entry has no remaining categories, the entry is removed from the lexicon.

      • removeEntry

        public void removeEntry(java.lang.String entry)
        Remove entry.
        Specified by:
        removeEntry in interface Lexicon
        Parameters:
        entry - The entry to remove.
      • loadLexicon

        public void loadLexicon(java.net.URL lexiconURL,
                       java.lang.String encoding)
                         throws java.io.IOException
        Load entries into a lexicon.
        Specified by:
        loadLexicon in interface Lexicon
        Parameters:
        lexiconURL - URL for the file containing the lexicon.
        encoding - Character encoding of lexicon file text.
        Throws:
        java.io.IOException
      • loadLexicon

        public void loadLexicon(java.net.URL lexiconURL,
                       boolean compressed,
                       java.lang.String encoding)
                         throws java.io.IOException
        Load entries into a lexicon.
        Specified by:
        loadLexicon in interface Lexicon
        Parameters:
        lexiconURL - URL for the file containing the lexicon.
        compressed - true if lexicon is gzip compressed.
        encoding - Character encoding of lexicon file text.
        Throws:
        java.io.IOException
      • computeUniqueEntryCountsForCategories

        protected void computeUniqueEntryCountsForCategories()
        Compute number of lexicon entries for each category.
      • getLexiconSize

        public int getLexiconSize()
        Get number of entries in Lexicon.
        Specified by:
        getLexiconSize in interface Lexicon
        Returns:
        Number of entries in Lexicon.

        Returns number of fixed entries

      • getEntries

        public java.lang.String[] getEntries()
        Get the entries, sorted in ascending order.
        Specified by:
        getEntries in interface Lexicon
        Returns:
        The sorted entry strings as an array of string.
      • getCategories

        public java.lang.String[] getCategories()
        Get the categories, sorted in ascending order.
        Specified by:
        getCategories in interface Lexicon
        Returns:
        The sorted category strings as an array of string.
      • containsEntry

        public boolean containsEntry(java.lang.String entry)
        Checks if lexicon contains an entry.
        Specified by:
        containsEntry in interface Lexicon
        Parameters:
        entry - Entry to look up.
        Returns:
        true if lexicon contains entry. Only an exact match is considered.
      • getLexiconEntry

        public LexiconEntry getLexiconEntry(java.lang.String entry)
        Get a lexicon entry.
        Specified by:
        getLexiconEntry in interface Lexicon
        Parameters:
        entry - Entry for which to get lexicon information.
        Returns:
        LexiconEntry for entry, or null if not found.

        Note: this does NOT call the part of speech guesser.

      • setLexiconEntry

        public LexiconEntry setLexiconEntry(java.lang.String entry,
                                   LexiconEntry entryData)
        Set a lexicon entry.
        Specified by:
        setLexiconEntry in interface Lexicon
        Parameters:
        entry - Entry for which to get lexicon information.
        entryData - The lexicon entry data.
        Returns:
        Previous lexicon data for entry, if any.
      • getCategoriesForEntry

        public java.util.Set<java.lang.String> getCategoriesForEntry(java.lang.String entry)
        Get categories for an entry in the lexicon.
        Specified by:
        getCategoriesForEntry in interface Lexicon
        Parameters:
        entry - Entry to look up.
        Returns:
        Set of categories. Null if entry not found in lexicon.
      • getCategoriesForEntry

        public java.util.Set<java.lang.String> getCategoriesForEntry(java.util.List<java.lang.String> sentence,
                                                            int entryIndex)
        Get categories for an entry in a sentence.
        Specified by:
        getCategoriesForEntry in interface Lexicon
        Parameters:
        sentence - List of entries in sentence.
        entryIndex - Index within sentence (0-based) of entry.
        Returns:
        Set of categories. Null if entry not found in lexicon.
      • getCategoriesForEntry

        public java.util.Set<java.lang.String> getCategoriesForEntry(java.lang.String entry,
                                                            boolean isFirstEntry)
        Get categories for an entry.
        Specified by:
        getCategoriesForEntry in interface Lexicon
        Parameters:
        entry - Entry to look up.
        isFirstEntry - True if entry is first in sentence.
        Returns:
        Set of categories. Null if entry not found in lexicon.
      • getNumberOfCategoriesForEntry

        public int getNumberOfCategoriesForEntry(java.lang.String entry)
        Get number of categories for an entry.
        Specified by:
        getNumberOfCategoriesForEntry in interface Lexicon
        Parameters:
        entry - Entry for which to find number of categories.
        Returns:
        Number of categories for entry.
      • getLargestCategory

        public java.lang.String getLargestCategory(java.lang.String entry)
        Get category with largest count for an entry.
        Specified by:
        getLargestCategory in interface Lexicon
        Parameters:
        entry - Entry to look up.
        Returns:
        Category with largest count. Null if entry not found in lexicon.
      • getCategoryCount

        public int getCategoryCount(java.lang.String category)
        Get category count.
        Specified by:
        getCategoryCount in interface Lexicon
        Parameters:
        category - Get number of times category appears in lexicon.
        Returns:
        Category count.
      • getUniqueEntryCountForCategory

        public int getUniqueEntryCountForCategory(java.lang.String category)
        Get unique entry count for a category.
        Parameters:
        category - Category.
        Returns:
        Count of unique entries with this category.
      • getCategoryCount

        public int getCategoryCount(java.lang.String entry,
                           java.lang.String category)
        Get count for an entry in a specific category.
        Specified by:
        getCategoryCount in interface Lexicon
        Parameters:
        entry - Entry to look up.
        category - Category for which to retrieve count.
        Returns:
        Number of occurrences of entry in category.
      • getLemma

        public java.lang.String getLemma(java.lang.String entry)
        Get lemma for an entry.
        Specified by:
        getLemma in interface Lexicon
        Parameters:
        entry - Entry to look up.
        Returns:
        Lemma form of entry. A "*' is returned if the lemma cannot be found.

        Some spellings may have multiple lemmata depending upon the part of speech. This method returns the lemma associated with the most frequently occurring part of speech.

      • getLemmata

        public java.lang.String[] getLemmata(java.lang.String entry)
        Get all lemmata for an entry.
        Specified by:
        getLemmata in interface Lexicon
        Parameters:
        entry - Entry to look up.
        Returns:
        Lemmata forms of entry.
      • getLemma

        public java.lang.String getLemma(java.lang.String entry,
                                java.lang.String category)
        Get lemma for an entry in a specific category.
        Specified by:
        getLemma in interface Lexicon
        Parameters:
        entry - Entry to look up.
        category - Category for which to retrieve lemma.
        Returns:
        Lemma form of entry. An "*' is returned if the lemma cannot be found.
      • getCategoryCounts

        public java.util.Map<java.lang.String,MutableInteger> getCategoryCounts()
        Get category counts.
        Specified by:
        getCategoryCounts in interface Lexicon
        Returns:
        Category counts map.
      • getNumberOfCategories

        public int getNumberOfCategories()
        Get number of categories.
        Specified by:
        getNumberOfCategories in interface Lexicon
        Returns:
        Number of categories.
      • getCategoryCountsForEntry

        public java.util.Map<java.lang.String,MutableInteger> getCategoryCountsForEntry(java.lang.String entry)
        Get category counts for an entry.
        Specified by:
        getCategoryCountsForEntry in interface Lexicon
        Parameters:
        entry - Entry to look up.
        Returns:
        Map of counts for each category. String keys are tags, MutableInteger counts are values. Null if entry not found in lexicon.
      • getEntryCount

        public int getEntryCount(java.lang.String entry)
        Get total count for an entry.
        Specified by:
        getEntryCount in interface Lexicon
        Parameters:
        entry - Entry to look up.
        Returns:
        Count of occurrences of entry.
      • saveLexiconToTextFile

        public void saveLexiconToTextFile(java.lang.String lexiconFileName,
                                 java.lang.String encoding)
                                   throws java.io.IOException
        Save lexicon to a file.
        Specified by:
        saveLexiconToTextFile in interface Lexicon
        Parameters:
        lexiconFileName - File containing the lexicon.
        encoding - Character encoding of lexicon file text.
        Throws:
        java.io.IOException
      • getLongestEntryLength

        public int getLongestEntryLength()
        Get the longest entry length in the lexicon.
        Specified by:
        getLongestEntryLength in interface Lexicon
        Returns:
        The longest entry length in the lexicon.
      • getShortestEntryLength

        public int getShortestEntryLength()
        Get the shortest entry length in the lexicon.
        Specified by:
        getShortestEntryLength in interface Lexicon
        Returns:
        The shortest entry length in the lexicon.
      • checkCategoriesList

        protected boolean checkCategoriesList()
        Check that all the tags in the lexicon appear in the designated part of speech tags list.
        Returns:
        true if all tags used in lexicon appear in designated part of speech list.
      • getPartOfSpeechTags

        public PartOfSpeechTags getPartOfSpeechTags()
        Get the part of speech tags list used by the lexicon.
        Specified by:
        getPartOfSpeechTags in interface Lexicon
        Returns:
        Part of speech tags list.
      • setPartOfSpeechTags

        public boolean setPartOfSpeechTags(PartOfSpeechTags partOfSpeechTags)
        Set the part of speech tags list used by the lexicon.
        Specified by:
        setPartOfSpeechTags in interface Lexicon
        Parameters:
        partOfSpeechTags - Part of speech tags list.
        Returns:
        true if all categories in lexicon appear in the part of speech tags list.

        For the check to work, the part of speech tags list should be set after the lexicon is loaded.