Library Docstrings

The Anacode Toolkit library consists of the two modules anacode.api and anacode.agg. anacode.api simplifies the use of the API, whereas anacode.agg provides functionality for further analysis, aggregation and visualization of the results.

anacode.api

Writers

class anacode.api.writers.Writer

Base “abstract” class containing common methods that are needed by all implementations of Writer interface.

The writer interface consists of init, close and write_bulk methods.

close()

Not implemented here! Each subclass should decide what to do here.

init()

Not implemented here! Each subclass should decide what to do here.

write_absa(analyzed, single_document=False)

Converts absa analysis result to flat lists and stores them.

Parameters:
  • analyzed (list) – JSON absa analysis result
  • single_document (bool) – Is analysis describing just one document
write_analysis(analyzed)

Inspects analysis result for performed analysis and delegates persisting of results to appropriate write methods.

Parameters:analyzed – JSON object analysis response
Type:dict
write_bulk(results)

Stores multiple anacode api’s JSON responses marked with call IDs as tuples (call_id, call_result). Both scrape and analyze call IDs are defined in anacode.codes module.

Parameters:results (list) – List of anacode responses with IDs of calls used
write_categories(analyzed, single_document=False)

Converts categories analysis result to flat lists and stores them.

Parameters:
  • analyzed (list) – JSON categories analysis result
  • single_document (bool) – Is analysis describing just one document
write_concepts(analyzed, single_document=False)

Converts concepts analysis result to flat lists and stores them.

Parameters:
  • analyzed (list) – JSON concepts analysis result
  • single_document (bool) – Is analysis describing just one document
write_row(call_type, call_result)

Decides what kind of data it got and calls appropriate write method.

Parameters:
  • call_type (int) – Library’s ID of anacode call
  • call_result (list) – JSON response from Anacode API
write_sentiment(analyzed, single_document=False)

Converts sentiment analysis result to flat lists and stores them.

Parameters:
  • analyzed (list) – JSON sentiment analysis result
  • single_document (bool) – Is analysis describing just one document
class anacode.api.writers.CSVWriter(target_dir='.')
__init__(target_dir='.')

Initializes Writer to store Anacode API analysis results in target_dir in csv files.

Parameters:target_dir (str) – Path to directory where to store csv files
class anacode.api.writers.DataFrameWriter(frames=None)

Writes Anacode API output into pandas.DataFrame instances.

__init__(frames=None)

Initializes dictionary of result frames. Alternatively uses given frames dict for storage.

Parameters:frames (dict) – Might be specified to use this instead of new dict

Querying

class anacode.api.client.AnacodeClient(auth, base_url='https://api.anacode.de/')

Makes posting data to server for analysis simpler by storing user’s auth, the URL of the Anacode API server and paths for analysis calls.

To find out more about specific API calls and analyses and their output format, please refer to https://api.anacode.de/api-docs/calls.html.

__init__(auth, base_url='https://api.anacode.de/')

Default value for base_url is taken from environment variable ANACODE_API_URL if set; otherwise, ‘https://api.anacode.de/‘ is used.

Parameters:
  • auth (str) – User’s token
  • base_url (str) – Anacode API server URL
analyze(texts, analyses, external_entity_data=None, single_document=False)

Use Anacode API to perform specified linguistic analysis on texts. Please consult https://api.anacode.de/api-docs/calls.html for more details and better understanding of parameters.

Parameters:
  • texts – List of texts to analyze
  • analyses – List of analysss to perform. Can contain ‘categories’, ‘concepts’, ‘sentiment’ and ‘absa’
  • external_entity_data – Provide additional entities to relate to sentiment evaluation.
  • single_document (bool) – Makes API treat texts as paragraphs of one document instead of treating them as separate documents
Returns:

dict –

call(task)

Given tuple of Anacode API analysis code and arguments for this analysis this will call appropriate method out of scrape, categories, concepts, sentiment or absa and return it’s result

Parameters:task (tuple) – Task definition tuple - (analysis code, analysis args)
Returns:dict –
scrape(link)

Use Anacode API’s scrape call to scrape page from Web URL and return result.

Parameters:link (str) – URL that should be scraped
Returns:dict –
class anacode.api.client.Analyzer(client, writer, threads=1, bulk_size=100)

This class makes querying with multiple threads and storing in other formats then list of json-s simple.

__init__(client, writer, threads=1, bulk_size=100)
Parameters:
  • client (anacode.api.client.AnacodeClient) – Will be used to post analysis to anacode api
  • writer (anacode.api.writers.Writer) – Needs to implement init, close and write_bulk methods from Writer interface
  • threads (int) – Number of concurrent threads to use, defaults to 1
  • bulk_size (int) – How often should writer’s write_bulk method be invoked, defaults to 100
analyze(texts, analyses, external_entity_data=None, single_document=False)

Dummy clone for anacode.api.client.AnacodeClient.analyze()

analyze_bulk()

Performs bulk analysis. Will use multiprocessing.dummy.Pool to post data to anacode api if number of threads is more than one.

Analysis results are not returned, but cached internally.

flush_analysis_data()

Writes all cached analysis results using writer.

scrape(link)

Dummy clone for anacode.api.client.AnacodeClient.scrape()

should_start_analysis()

Checks how many tasks are in queue and returns boolean indicating whether analysis should be performed.

Returns:bool – True if analysis should happen now, False otherwise
anacode.api.client.analyzer(auth, writer, threads=1, bulk_size=100, base_url='https://api.anacode.de/')

Convenient function for initializing bulk analyzer and potentially temporary writer instance as well.

Parameters:
  • auth (str) – User’s token string
  • threads (int) – Number of threads to use for https communication with server
  • writer (str) – Writer instance that will store analysis results or path to folder where csv-s should be saved or dictionary where data frames should be stored
  • bulk_size (int) –
  • base_url (str) – Anacode API server URL
Returns:

anacode.api.client.Analyzer – Bulk analyzer instance

anacode.agg

Dataset loader

class anacode.agg.aggregation.DatasetLoader(concepts=None, concepts_surface_strings=None, categories=None, sentiments=None, absa_entities=None, absa_normalized_texts=None, absa_relations=None, absa_relations_entities=None, absa_evaluations=None, absa_evaluations_entities=None)

Loads analysed data obtained via Anacode API from various formats.

__getitem__(item)

If item is the name of linguistic dataset known to DatasetLoader, it will return the corresponding dataset. If the dataset is not found, None is returned. If item is not recognized, a KeyError is thrown.

Parameters:item (str) – possible values: categories, concepts, concepts_surface_strings, sentiments, absa_entities, absa_normalized_texts, absa_relations, absa_relations_entities, absa_evaluations, absa_evaluations_entities
Returns:pandas.DataFrame – DataFrame with requested data if found, else None
__init__(concepts=None, concepts_surface_strings=None, categories=None, sentiments=None, absa_entities=None, absa_normalized_texts=None, absa_relations=None, absa_relations_entities=None, absa_evaluations=None, absa_evaluations_entities=None)

Will construct DatasetLoader instance that is aware of what data is available to it. Raises ValueError if no data was provided.

Data frames are expected to have format that corresponds to format that anacode.api.writers.Writer would write.

Parameters:
  • concepts (pandas.DataFrame) – List of found concepts with metadata
  • concepts_surface_strings (pandas.DataFrame) – List of strings realizing concepts
  • categories (pandas.DataFrame) – List of document category probabilities
  • sentiments (pandas.DataFrame) – List of document sentiment polarities
  • absa_entities (pandas.DataFrame) – List of absa entities used in texts
  • absa_normalized_texts (pandas.DataFrame) – List of Chinese normalized strings identified and analyzed by absa
  • absa_relations (pandas.DataFrame) – List of absa relations with metadata
  • absa_relations_entities (pandas.DataFrame) – List of absa entities used in relations
  • absa_evaluations (pandas.DataFrame) – List of absa evaluations
  • absa_evaluations_entities (pandas.DataFrame) – List of absa entities used in evaluations
absa

Creates new ABSADataset if data is available.

Returns:anacode.agg.aggregations.ABSADataset
categories

Creates new CategoriesDataset if data is available.

Returns:anacode.agg.aggregations.CategoriesDataset
concepts

Creates new ConceptsDataset if data is available.

Returns:anacode.agg.aggregations.ConceptsDataset
filter(document_ids)

Creates new DatasetLoader instance using data only from documents with ids in document_ids.

Parameters:document_ids (iterable) – Iterable with document ids. Cannot be empty.
Returns:DatasetLoader – New DatasetLoader instance with data only from desired documents
classmethod from_api_result(result)

Initializes DatasetLoader from API JSON output. Works with both single analysis result and with list of analyses results.

Parameters:result – Either single API JSON analysis dict or list of them
Returns:anacode.agg.DatasetLoader – DatasetLoader with available analysis data loaded
classmethod from_path(path, backup_suffix='')

Initializes DatasetLoader from AnacodeAPI csv files present in given path. You could have obtained these by using anacode.api.writers.CSVWriter to write your request results when you were querying AnacodeAPI.

Parameters:
  • path (str) – Path to folder where AnacodeAPI analysis is stored in csv files
  • backup_suffix (str) – If you want to load older dataset from file that has been backed up by toolkit, use this to specify suffix of file names
Returns:

anacode.agg.DatasetLoader – DatasetLoader with found csv files loaded into data frames

classmethod from_writer(writer)

Initializes DatasetLoader from writer instance that was used to store anacode analysis. Accepts both anacode.api.writers.DataFrameWriter and anacode.api.writers.CSVWriter.

Parameters:writer (anacode.api.writers.Writer) – Writer that was used by anacode.api.client.Analyzer to store analysis
Returns:anacode.agg.DatasetLoader – DatasetLoader with available data frames loaded
remove_concepts(concepts)

Remove given concepts from dataset if they are present.

Parameters:concepts (iterable) – These concepts will be removed from dataset
sentiments

Creates new SentimentDataset if data is available.

Returns:anacode.agg.aggregations.SentimentDataset

API Datasets

class anacode.agg.aggregation.ApiCallDataset

Base class for specific call data sets.

class anacode.agg.aggregation.CategoriesDataset(categories)

Categories dataset container with easy aggregation capabilities.

__init__(categories)

Initialize instance by providing categories data set.

Parameters:categories (pandas.DataFrame) – List of document category probabilities
categories()

Aggregates categories across the whole dataset.

Returns:pandas.Series –
main_category()

Finds the main category of a dataset.

Returns:str – Name of main category.
class anacode.agg.aggregation.ConceptsDataset(concepts, surface_strings)

Concept dataset container with easy aggregation capabilities.

__init__(concepts, surface_strings)

Initialize instance by providing the two dataframes required for concepts representation.

Parameters:
  • concepts (pandas.DataFrame) – List of found concepts with metadata
  • surface_strings (pandas.DataFrame) – List of strings realizing found concepts
co_occurring_concepts(concept, n=15, concept_type='')

Find n concepts co-occurring frequently in texts of this dataset with given concept, sorted by descending frequency. Co-occurring concepts can be filtered by their type.

Parameters:
  • concept (str) – Concept to inspect for co-occurring concepts
  • n (int) – Maximum number of returned co-occurring concepts
  • concept_type (str) – Limit co-occurring concept counts only to this type of concepts.
Returns:

pandas.Series – Co-occurring concept names as index and their frequencies sorted by descending frequency

concept_frequencies(max_concepts=200, concept_type='', concept_filter=None)

Returns pandas series with counts for all concepts from the dataset.

To filter words that will be showed in the cloud you can use concept_type and concept_filter. The former is specific type of concepts that you only want to have present in the result and the latter is callable that takes concept name and returns bool to indicate whether given concept should pass the filter. You can set both at the same time. concept_type is applied first, concept_filter second.

Parameters:
  • max_concepts (int) – Maximum number of concepts that will be plotted
  • concept_type (str) – Limit concepts only to concepts whose type starts with this string
  • concept_filter (callable) – If not None given callable needs to accept one string parameter that is concept name and evaluate it if it should pass the filter - callable returns True - or not - callable returns False. Only concepts that pass can be seen on resulting concept cloud image
Returns:

pandas.Series – Concept names as index and their counts as values

concept_frequency(concept, concept_type='', normalize=False)

Return occurrence count of input concept or concept list. Resulting list has concepts sorted just like they were in input if it was list or tuple. Concepts that are not of concept_type or that are not in the dataset will always have zero count. Setting normalize will turn absolute counts into relative percentages.

Specifying concept_type is intended to be used only with normalization. If used without normalize set it will have no effect except for concepts that do not have said type whose count will be zero. When used with normalize set percentages will reflect counts only within specified concept_type instead of the whole dataset.

Parameters:
  • concept (list, tuple, set or string) – name(s) of concept to count occurrences for
  • concept_type (str) – Limit result concepts counts only to concepts with this type
  • normalize (bool) – Returns relative counts of concepts in specified concept type if set, otherwise returns absolute counts
Returns:

pandas.Series – Concept names as index and their counts as values sorted as they were in input.

least_common_concepts(n=15, concept_type='', normalize=False)

Counts concepts and returns n least frequent ones sorted by their count ascending. Counted concepts can be filtered by their type using concept_type and returned counts can be normalized with normalize.

If both concept_type and normalize are specified, concept ratios will be computed only from concept counts within given concept_type.

Parameters:
  • n (int) – Maximum number of concepts to return
  • concept_type (str) – Limit concept counts only to concepts whose type starts with this string
  • normalize (bool) – Returns relative frequencies if normalize is True
Returns:

pandas.Series – Concept names as index and their counts as values sorted ascending

make_idf_filter(threshold, concept_type='')

Generates concept filter based on idf values of concepts in represented documents. This filter can be directly used as parameter for concept_cloud call.

Parameters:
  • threshold (float) – Minimum IDF of concept that will pass the filter
  • concept_type (str) – Limit co-occurring concept counts only to this type of concepts.
Returns:

callable – Function that can be used as idf_func in concept_cloud

make_time_series(concepts, date_info, delta, interval=None)

Creates DataFrame with counts for each concepts in every delta time tick that exists in interval. If you do not specify interval it will be computed from date_info to include all documents.

In concepts dataset there is no information about document release date so you will have to provide this information externally as date_info. It needs to be a map object that has all document ids from concept’s dataset as keys and they refer to datetime.date representing release date for the document.

Result will include 0 counts for ticks where concepts were not mentioned. In each row there will also be start and stop times for that particular count. Counts from stop time are not included in the tick.

Parameters:
  • concepts (list) – List of concept names to make time series for
  • date_info (dict) – Keys need to be document ids in this dataset and values datetime.datetime or datetime.date objects
  • delta – Time series tick size
  • interval (tuple) – (start, stop) where both values are datetimes or dates
Returns:

pandas.DataFrame – DataFrame with columns “Concept”, “Count”, “Start” and “Stop”

most_common_concepts(n=15, concept_type='', normalize=False)

Counts concepts and returns n most occurring ones sorted by their count descending. Counted concepts can be filtered by their type using concept_type and returned counts can be normalized with normalize.

If both concept_type and normalize are specified concept ratios will be computed only from concept counts within given concept_type.

Parameters:
  • n (int) – Maximum number of most common concepts to return
  • concept_type (str) – Limit concept counts only to concepts whose type starts with this string
  • normalize (bool) – Returns relative frequencies if normalize is True
Returns:

pandas.Series – Concept names as index and their counts as values sorted descending

nltk_textcollection(concept_type='')

Wraps concepts of each represented documents into nltk.text.Text and returns these wrapped in nltk.text.TextCollection.

Parameters:concept_type (str) – Limit gathered concepts only to this type of concepts
Returns:nltk.text.TextCollection – TextCollection of represented documents
surface_forms(concept, n=15)

Find n random surface strings from analyzed text that were identified as concept.

Parameters:
  • concept – Inspect this concept surface forms.
  • n – Maximum number of unique surface forms returned
Returns:

set – Set with maximum of n surface forms of concept

class anacode.agg.aggregation.SentimentDataset(sentiments)

Sentiment dataset container with easy aggregation capabilities.

__init__(sentiments)

Initialize instance by providing sentiments data set.

Parameters:sentiments (pandas.DataFrame) – List of document sentiment inclinations
average_sentiment()

Computes and returns average document sentiment. Result is a number from [-1,1], where higher number means more positive sentiment.

Returns:float – Average document sentiment
class anacode.agg.aggregation.ABSADataset(entities, normalized_texts, relations, relations_entities, evaluations, evaluations_entities)

ABSA data set container that will provides easy aggregation capabilities.

__init__(entities, normalized_texts, relations, relations_entities, evaluations, evaluations_entities)

Initialize instance by providing all absa data sets.

Parameters:
  • entities (pandas.DataFrame) – List of entities used in texts
  • normalized_texts (pandas.DataFrame) – List of chinese normalized texts
  • relations (pandas.DataFrame) – List of relations with metadata
  • relations_entities (pandas.DataFrame) – List of entities used in relations
  • evaluations (pandas.DataFrame) – List of entity evaluations
  • evaluations_entities (pandas.DataFrame) – List of entities used in evaluations
best_rated_entities(n=15, entity_type='')

Find top n rated entities in this dataset sorted descending by their mean rating.

Parameters:
  • n (int) – Maximum count of returned entities
  • entity_type (str) – Optional filter for entity type to consider
Returns:

pandas.Series – Best rated entities in this dataset as index and their ratings as values sorted descending

co_occurring_entities(entity, n=15, entity_type='')

Find n entities co-occurring frequently in texts of this dataset with given entity, sorted descending. Co-occurring entities can be filtered by their type.

Parameters:
  • entity (str) – Entity to inspect for co-occurring entities
  • n (int) – Maximum count of returned entities
  • entity_type (str) – Limit co-occurring entity counts only to this type of entities.
Returns:

pandas.Series – Co-occurring entity names as index and their counts as values sorted descending

entity_frequency(entity, entity_type='', normalize=False)

Return occurrence count of input entity or entity list. Resulting list has entities sorted just like they were in input if it was list or tuple. Entities whose entity_type does not start with given one or that are not in the dataset will always have zero count. Setting normalize will turn absolute counts into relative percentages.

Specifying entity_type is intended to be used only with normalization. If used without normalize set it will have no effect except for possible before mentioned zeroing. When used with normalize set result percentages will reflect counts only within specified concept_type instead of the whole dataset.

Parameters:
  • entity (tuple, list, set or str) – Entity name or tuple/list/set of entity names
  • entity_type (str) – Optional filter for entity type to consider
  • normalize (bool) – Returns relative frequencies if normalize is True
Returns:

pandas.Series – Entity names as index entity frequencies as values sorted as input if it was tuple or list

entity_sentiment(entity)

Computes and return mean rating for given entity or entities if list, tuple or set is given. If input is list or tuple result Series is sorted as input was.

Parameters:entity (tuple, list, set or str) – Name(s) of entity(ies) to compute mean sentiment for
Returns:pandas.Series – Mean ratings for entities, np.nan if entity was not rated. Entity names are in index and their sentiments are values
entity_texts(entity)

Returns list of normalized texts where each entity can be found as a dictionary.

Parameters:entity (tuple, list, set or str) – Name of entities to find in normalized texts
Returns:dict – Map where keys are concept names and values are lists of normalized strings
least_common_entities(n=15, entity_type='', normalize=False)

Counts entities and returns n least frequent ones sorted by their count ascending. Counted entities can be filtered by their type using entity_type and returned counts can be normalized with normalize.

If both entity_type and normalize are specified entity ratios will be computed only from entity counts within given entity_type.

Parameters:
  • n (int) – Maximum number of least frequent entities to return
  • entity_type (str) – Limit entities counts only to entities whose type starts with this string
  • normalize (bool) – Returns relative frequencies if normalize is True
Returns:

pandas.Series – Entity names as index and their counts as values sorted descending

most_common_entities(n=15, entity_type='', normalize=False)

Counts entities and returns n most occurring ones sorted by their count descending. Counted entities can be filtered by their type using entity_type and returned counts can be normalized with normalize.

If both entity_type and normalize are specified entity ratios will be computed only from entity counts within given entity_type.

Parameters:
  • n (int) – Maximum number of most common entities to return
  • entity_type (str) – Limit entities counts only to entities whose type starts with this string
  • normalize (bool) – Returns relative frequencies if normalize is True
Returns:

pandas.Series – Entity names as index and their counts as values sorted descending

surface_strings(entity)

Returns list of surface strings for each entity specified in entity as a dictionary.

Parameters:entity (tuple, list, set or str) – Name of entities to find in normalized texts
Returns:dict – Map where keys are entity names and values are lists of normalized strings
worst_rated_entities(n=15, entity_type='')

Find n worst rated entities in this dataset sorted ascending by their mean rating.

Parameters:
  • n (int) – Maximum count of returned entities
  • entity_type (str) – Optional filter for entity type to consider
Returns:

pandas.DataFrame – Worst rated entities in this dataset as index and their ratings as values sorted ascending

anacode.agg.plotting

anacode.agg.plotting.barhchart(aggregation, path=None, color='dull green', title=None)

Plots result from aggregation output in form of horizontal bar chart.

Parameters:
  • aggregation (pd.Series) – anacode.agg aggregation result
  • path (str) – If specified graph will be saved to this file instead of returning it as a result
  • color (str) – seaborn named color for bars
  • title (str) – Title of chart; set to None for automatic title, empty string for no title
Returns:

matplotlib.axes._subplots.AxesSubplot – Axes for generated plot or None if graph was saved to file

anacode.agg.plotting.piechart(aggregation, path=None, colors=None, category_count=6, explode=0, edgesize=0, edgecolor='#333333', perc_color='black', labeldistance=1.1)

Plots piechart with categories.

Parameters:
  • aggregation (pd.Series) – Aggregation library result
  • path (str) – If specified graph will be saved to this file instead of returning it as a result
  • colors – This will be passed to matplotlib.pyplot.piechart as colors
  • category_count (int) – How many categories to include in piecharm
  • explode (float) – Size of whitespace between pie pieces
  • edgesize (int) – Pie’s edge size, set to 0 for no edge
  • edgecolor (matplotlib supported color) – Color of pie’s edge
  • perc_color (matplotlib supported color) – Controls color of percentages drawn inside piechart
Returns:

matplotlib.axes._subplots.AxesSubplot – Axes for generated plot or None if graph was saved to file

anacode.agg.plotting.concept_cloud(aggregation, path=None, size=(600, 400), background='white', colormap_name='Accent', max_concepts=200, stopwords=None, font=None)

Generates concept cloud image from frequencies and stores it to path. If path is None, returns image as np.ndarray instead. One way to view resulting image is to use matplotlib’s imshow method.

Parameters:
  • frequencies (list) – List of (concept: str, frequency: int) pairs to plot
  • path (str) – Save plot to this file. Set to None if you want raw image np.ndarray of this plot as a return value
  • size (tuple) – Size of plot in pixels as tuple (width: int, height: int)
  • background (matplotlib color definition) – Name of background color
  • colormap_name (str) – Name of matplotlib colormap that will be used to sample random colors for concepts in plot
  • max_concepts (int) – Maximum number of concepts that will be plotted
  • stopwords (iter) – Optionally set stopwords to use for the plot
  • font (str) – Path to font that will be used
Returns:

matplotlib.axes._subplots.AxesSubplot – Axes for generated plot or None if graph was saved to file