Library Docstrings¶

The Anacode Toolkit library consists of the two modules anacode.api and anacode.agg. anacode.api simplifies the use of the API, whereas anacode.agg provides functionality for further analysis, aggregation and visualization of the results.

anacode.api
- Writers
- Querying
anacode.agg
- Dataset loader
- API Datasets
anacode.agg.plotting

anacode.api ¶

Writers ¶

class anacode.api.writers.Writer¶

Base “abstract” class containing common methods that are needed by all implementations of Writer interface.

The writer interface consists of init, close and write_bulk methods.

close()¶: Not implemented here! Each subclass should decide what to do here.

init()¶: Not implemented here! Each subclass should decide what to do here.

write_absa(analyzed, single_document=False)¶

Converts absa analysis result to flat lists and stores them.

Parameters:	analyzed (list) – JSON absa analysis result single_document (bool) – Is analysis describing just one document

write_analysis(analyzed)¶

Inspects analysis result for performed analysis and delegates persisting of results to appropriate write methods.

Parameters:	analyzed – JSON object analysis response
Type:	dict

write_bulk(results)¶

Stores multiple anacode api’s JSON responses marked with call IDs as tuples (call_id, call_result). Both scrape and analyze call IDs are defined in anacode.codes module.

Parameters:	results (list) – List of anacode responses with IDs of calls used

write_categories(analyzed, single_document=False)¶

Converts categories analysis result to flat lists and stores them.

Parameters:	analyzed (list) – JSON categories analysis result single_document (bool) – Is analysis describing just one document

write_concepts(analyzed, single_document=False)¶

Converts concepts analysis result to flat lists and stores them.

Parameters:	analyzed (list) – JSON concepts analysis result single_document (bool) – Is analysis describing just one document

write_row(call_type, call_result)¶

Decides what kind of data it got and calls appropriate write method.

Parameters:	call_type (int) – Library’s ID of anacode call call_result (list) – JSON response from Anacode API

write_sentiment(analyzed, single_document=False)¶

Converts sentiment analysis result to flat lists and stores them.

Parameters:	analyzed (list) – JSON sentiment analysis result single_document (bool) – Is analysis describing just one document

class anacode.api.writers.CSVWriter(target_dir='.')¶

__init__(target_dir='.')¶

Initializes Writer to store Anacode API analysis results in target_dir in csv files.

Parameters:	target_dir (str) – Path to directory where to store csv files

class anacode.api.writers.DataFrameWriter(frames=None)¶

Writes Anacode API output into pandas.DataFrame instances.

__init__(frames=None)¶

Initializes dictionary of result frames. Alternatively uses given frames dict for storage.

Parameters:	frames (dict) – Might be specified to use this instead of new dict

Querying ¶

class anacode.api.client.AnacodeClient(auth, base_url='https://api.anacode.de/')¶

Makes posting data to server for analysis simpler by storing user’s auth, the URL of the Anacode API server and paths for analysis calls.

To find out more about specific API calls and analyses and their output format, please refer to https://api.anacode.de/api-docs/calls.html.

__init__(auth, base_url='https://api.anacode.de/')¶

Default value for base_url is taken from environment variable ANACODE_API_URL if set; otherwise, ‘https://api.anacode.de/‘ is used.

Parameters:	auth (str) – User’s token base_url (str) – Anacode API server URL

analyze(texts, analyses, external_entity_data=None, single_document=False)¶

Use Anacode API to perform specified linguistic analysis on texts. Please consult https://api.anacode.de/api-docs/calls.html for more details and better understanding of parameters.

Parameters:	texts – List of texts to analyze analyses – List of analysss to perform. Can contain ‘categories’, ‘concepts’, ‘sentiment’ and ‘absa’ external_entity_data – Provide additional entities to relate to sentiment evaluation. single_document (bool) – Makes API treat texts as paragraphs of one document instead of treating them as separate documents
Returns:	dict –

call(task)¶

Given tuple of Anacode API analysis code and arguments for this analysis this will call appropriate method out of scrape, categories, concepts, sentiment or absa and return it’s result

Parameters:	task (tuple) – Task definition tuple - (analysis code, analysis args)
Returns:	dict –

scrape(link)¶

Use Anacode API’s scrape call to scrape page from Web URL and return result.

Parameters:	link (str) – URL that should be scraped
Returns:	dict –

class anacode.api.client.Analyzer(client, writer, threads=1, bulk_size=100)¶

This class makes querying with multiple threads and storing in other formats then list of json-s simple.

__init__(client, writer, threads=1, bulk_size=100)¶

Parameters:	client (`anacode.api.client.AnacodeClient`) – Will be used to post analysis to anacode api writer (`anacode.api.writers.Writer`) – Needs to implement init, close and write_bulk methods from Writer interface threads (int) – Number of concurrent threads to use, defaults to 1 bulk_size (int) – How often should writer’s write_bulk method be invoked, defaults to 100

analyze(texts, analyses, external_entity_data=None, single_document=False)¶: Dummy clone for anacode.api.client.AnacodeClient.analyze()

analyze_bulk()¶

Performs bulk analysis. Will use multiprocessing.dummy.Pool to post data to anacode api if number of threads is more than one.

Analysis results are not returned, but cached internally.

flush_analysis_data()¶: Writes all cached analysis results using writer.

scrape(link)¶: Dummy clone for anacode.api.client.AnacodeClient.scrape()

should_start_analysis()¶

Checks how many tasks are in queue and returns boolean indicating whether analysis should be performed.

Returns:	bool – True if analysis should happen now, False otherwise

anacode.api.client.analyzer(auth, writer, threads=1, bulk_size=100, base_url='https://api.anacode.de/')¶

Convenient function for initializing bulk analyzer and potentially temporary writer instance as well.

Parameters:	auth (str) – User’s token string threads (int) – Number of threads to use for https communication with server writer (str) – Writer instance that will store analysis results or path to folder where csv-s should be saved or dictionary where data frames should be stored bulk_size (int) – base_url (str) – Anacode API server URL
Returns:	`anacode.api.client.Analyzer` – Bulk analyzer instance

anacode.agg ¶

Dataset loader ¶

class anacode.agg.aggregation.DatasetLoader(concepts=None, concepts_surface_strings=None, categories=None, sentiments=None, absa_entities=None, absa_normalized_texts=None, absa_relations=None, absa_relations_entities=None, absa_evaluations=None, absa_evaluations_entities=None)¶

Loads analysed data obtained via Anacode API from various formats.

__getitem__(item)¶

If item is the name of linguistic dataset known to DatasetLoader, it will return the corresponding dataset. If the dataset is not found, None is returned. If item is not recognized, a KeyError is thrown.

Parameters:	item (str) – possible values: categories, concepts, concepts_surface_strings, sentiments, absa_entities, absa_normalized_texts, absa_relations, absa_relations_entities, absa_evaluations, absa_evaluations_entities
Returns:	pandas.DataFrame – DataFrame with requested data if found, else None

__init__(concepts=None, concepts_surface_strings=None, categories=None, sentiments=None, absa_entities=None, absa_normalized_texts=None, absa_relations=None, absa_relations_entities=None, absa_evaluations=None, absa_evaluations_entities=None)¶

Will construct DatasetLoader instance that is aware of what data is available to it. Raises ValueError if no data was provided.

Data frames are expected to have format that corresponds to format that anacode.api.writers.Writer would write.

Parameters:

concepts (pandas.DataFrame) – List of found concepts with metadata
concepts_surface_strings (pandas.DataFrame) – List of strings realizing concepts
categories (pandas.DataFrame) – List of document category probabilities
sentiments (pandas.DataFrame) – List of document sentiment polarities
absa_entities (pandas.DataFrame) – List of absa entities used in texts
absa_normalized_texts (pandas.DataFrame) – List of Chinese normalized strings identified and analyzed by absa
absa_relations (pandas.DataFrame) – List of absa relations with metadata
absa_relations_entities (pandas.DataFrame) – List of absa entities used in relations
absa_evaluations (pandas.DataFrame) – List of absa evaluations
absa_evaluations_entities (pandas.DataFrame) – List of absa entities used in evaluations

absa¶

Creates new ABSADataset if data is available.

Returns:	`anacode.agg.aggregations.ABSADataset` –

categories¶

Creates new CategoriesDataset if data is available.

Returns:	`anacode.agg.aggregations.CategoriesDataset` –

concepts¶

Creates new ConceptsDataset if data is available.

Returns:	`anacode.agg.aggregations.ConceptsDataset` –

filter(document_ids)¶

Creates new DatasetLoader instance using data only from documents with ids in document_ids.

Parameters:	document_ids (iterable) – Iterable with document ids. Cannot be empty.
Returns:	DatasetLoader – New DatasetLoader instance with data only from desired documents

classmethod from_api_result(result)¶

Initializes DatasetLoader from API JSON output. Works with both single analysis result and with list of analyses results.

Parameters:	result – Either single API JSON analysis dict or list of them
Returns:	`anacode.agg.DatasetLoader` – DatasetLoader with available analysis data loaded

classmethod from_path(path, backup_suffix='')¶

Initializes DatasetLoader from AnacodeAPI csv files present in given path. You could have obtained these by using anacode.api.writers.CSVWriter to write your request results when you were querying AnacodeAPI.

Parameters:	path (str) – Path to folder where AnacodeAPI analysis is stored in csv files backup_suffix (str) – If you want to load older dataset from file that has been backed up by toolkit, use this to specify suffix of file names
Returns:	`anacode.agg.DatasetLoader` – DatasetLoader with found csv files loaded into data frames

classmethod from_writer(writer)¶

Initializes DatasetLoader from writer instance that was used to store anacode analysis. Accepts both anacode.api.writers.DataFrameWriter and anacode.api.writers.CSVWriter.

Parameters:	writer (anacode.api.writers.Writer) – Writer that was used by `anacode.api.client.Analyzer` to store analysis
Returns:	`anacode.agg.DatasetLoader` – DatasetLoader with available data frames loaded

remove_concepts(concepts)¶

Remove given concepts from dataset if they are present.

Parameters:	concepts (iterable) – These concepts will be removed from dataset

sentiments¶

Creates new SentimentDataset if data is available.

Returns:	`anacode.agg.aggregations.SentimentDataset` –

API Datasets ¶

class anacode.agg.aggregation.ApiCallDataset¶: Base class for specific call data sets.

class anacode.agg.aggregation.CategoriesDataset(categories)¶

Categories dataset container with easy aggregation capabilities.

__init__(categories)¶

Initialize instance by providing categories data set.

Parameters:	categories (pandas.DataFrame) – List of document category probabilities

categories()¶

Aggregates categories across the whole dataset.

Returns:	pandas.Series –

main_category()¶

Finds the main category of a dataset.

Returns:	str – Name of main category.

class anacode.agg.aggregation.ConceptsDataset(concepts, surface_strings)¶

Concept dataset container with easy aggregation capabilities.

__init__(concepts, surface_strings)¶

Initialize instance by providing the two dataframes required for concepts representation.

Parameters:	concepts (pandas.DataFrame) – List of found concepts with metadata surface_strings (pandas.DataFrame) – List of strings realizing found concepts

co_occurring_concepts(concept, n=15, concept_type='')¶

Find n concepts co-occurring frequently in texts of this dataset with given concept, sorted by descending frequency. Co-occurring concepts can be filtered by their type.

Parameters:	concept (str) – Concept to inspect for co-occurring concepts n (int) – Maximum number of returned co-occurring concepts concept_type (str) – Limit co-occurring concept counts only to this type of concepts.
Returns:	pandas.Series – Co-occurring concept names as index and their frequencies sorted by descending frequency

concept_frequencies(max_concepts=200, concept_type='', concept_filter=None)¶

Returns pandas series with counts for all concepts from the dataset.

To filter words that will be showed in the cloud you can use concept_type and concept_filter. The former is specific type of concepts that you only want to have present in the result and the latter is callable that takes concept name and returns bool to indicate whether given concept should pass the filter. You can set both at the same time. concept_type is applied first, concept_filter second.

Parameters:

max_concepts (int) – Maximum number of concepts that will be plotted
concept_type (str) – Limit concepts only to concepts whose type starts with this string
concept_filter (callable) – If not None given callable needs to accept one string parameter that is concept name and evaluate it if it should pass the filter - callable returns True - or not - callable returns False. Only concepts that pass can be seen on resulting concept cloud image

Returns:

pandas.Series – Concept names as index and their counts as values

concept_frequency(concept, concept_type='', normalize=False)¶

Return occurrence count of input concept or concept list. Resulting list has concepts sorted just like they were in input if it was list or tuple. Concepts that are not of concept_type or that are not in the dataset will always have zero count. Setting normalize will turn absolute counts into relative percentages.

Specifying concept_type is intended to be used only with normalization. If used without normalize set it will have no effect except for concepts that do not have said type whose count will be zero. When used with normalize set percentages will reflect counts only within specified concept_type instead of the whole dataset.

Parameters:	concept (list, tuple, set or string) – name(s) of concept to count occurrences for concept_type (str) – Limit result concepts counts only to concepts with this type normalize (bool) – Returns relative counts of concepts in specified concept type if set, otherwise returns absolute counts
Returns:	pandas.Series – Concept names as index and their counts as values sorted as they were in input.

least_common_concepts(n=15, concept_type='', normalize=False)¶

Counts concepts and returns n least frequent ones sorted by their count ascending. Counted concepts can be filtered by their type using concept_type and returned counts can be normalized with normalize.

If both concept_type and normalize are specified, concept ratios will be computed only from concept counts within given concept_type.

Parameters:	n (int) – Maximum number of concepts to return concept_type (str) – Limit concept counts only to concepts whose type starts with this string normalize (bool) – Returns relative frequencies if normalize is True
Returns:	pandas.Series – Concept names as index and their counts as values sorted ascending

make_idf_filter(threshold, concept_type='')¶

Generates concept filter based on idf values of concepts in represented documents. This filter can be directly used as parameter for concept_cloud call.

Parameters:	threshold (float) – Minimum IDF of concept that will pass the filter concept_type (str) – Limit co-occurring concept counts only to this type of concepts.
Returns:	callable – Function that can be used as idf_func in concept_cloud

make_time_series(concepts, date_info, delta, interval=None)¶

Creates DataFrame with counts for each concepts in every delta time tick that exists in interval. If you do not specify interval it will be computed from date_info to include all documents.

In concepts dataset there is no information about document release date so you will have to provide this information externally as date_info. It needs to be a map object that has all document ids from concept’s dataset as keys and they refer to datetime.date representing release date for the document.

Result will include 0 counts for ticks where concepts were not mentioned. In each row there will also be start and stop times for that particular count. Counts from stop time are not included in the tick.

Parameters:	concepts (list) – List of concept names to make time series for date_info (dict) – Keys need to be document ids in this dataset and values datetime.datetime or datetime.date objects delta – Time series tick size interval (tuple) – (start, stop) where both values are datetimes or dates
Returns:	pandas.DataFrame – DataFrame with columns “Concept”, “Count”, “Start” and “Stop”

most_common_concepts(n=15, concept_type='', normalize=False)¶

Counts concepts and returns n most occurring ones sorted by their count descending. Counted concepts can be filtered by their type using concept_type and returned counts can be normalized with normalize.

If both concept_type and normalize are specified concept ratios will be computed only from concept counts within given concept_type.

Parameters:	n (int) – Maximum number of most common concepts to return concept_type (str) – Limit concept counts only to concepts whose type starts with this string normalize (bool) – Returns relative frequencies if normalize is True
Returns:	pandas.Series – Concept names as index and their counts as values sorted descending

nltk_textcollection(concept_type='')¶

Wraps concepts of each represented documents into nltk.text.Text and returns these wrapped in nltk.text.TextCollection.

Parameters:	concept_type (str) – Limit gathered concepts only to this type of concepts
Returns:	nltk.text.TextCollection – TextCollection of represented documents

surface_forms(concept, n=15)¶

Find n random surface strings from analyzed text that were identified as concept.

Parameters:	concept – Inspect this concept surface forms. n – Maximum number of unique surface forms returned
Returns:	set – Set with maximum of n surface forms of concept

class anacode.agg.aggregation.SentimentDataset(sentiments)¶

Sentiment dataset container with easy aggregation capabilities.

__init__(sentiments)¶

Initialize instance by providing sentiments data set.

Parameters:	sentiments (pandas.DataFrame) – List of document sentiment inclinations

average_sentiment()¶

Computes and returns average document sentiment. Result is a number from [-1,1], where higher number means more positive sentiment.

Returns:	float – Average document sentiment

class anacode.agg.aggregation.ABSADataset(entities, normalized_texts, relations, relations_entities, evaluations, evaluations_entities)¶

ABSA data set container that will provides easy aggregation capabilities.

__init__(entities, normalized_texts, relations, relations_entities, evaluations, evaluations_entities)¶

Initialize instance by providing all absa data sets.

Parameters:

entities (pandas.DataFrame) – List of entities used in texts
normalized_texts (pandas.DataFrame) – List of chinese normalized texts
relations (pandas.DataFrame) – List of relations with metadata
relations_entities (pandas.DataFrame) – List of entities used in relations
evaluations (pandas.DataFrame) – List of entity evaluations
evaluations_entities (pandas.DataFrame) – List of entities used in evaluations

best_rated_entities(n=15, entity_type='')¶

Find top n rated entities in this dataset sorted descending by their mean rating.

Parameters:	n (int) – Maximum count of returned entities entity_type (str) – Optional filter for entity type to consider
Returns:	pandas.Series – Best rated entities in this dataset as index and their ratings as values sorted descending

co_occurring_entities(entity, n=15, entity_type='')¶

Find n entities co-occurring frequently in texts of this dataset with given entity, sorted descending. Co-occurring entities can be filtered by their type.

Parameters:	entity (str) – Entity to inspect for co-occurring entities n (int) – Maximum count of returned entities entity_type (str) – Limit co-occurring entity counts only to this type of entities.
Returns:	pandas.Series – Co-occurring entity names as index and their counts as values sorted descending

entity_frequency(entity, entity_type='', normalize=False)¶

Return occurrence count of input entity or entity list. Resulting list has entities sorted just like they were in input if it was list or tuple. Entities whose entity_type does not start with given one or that are not in the dataset will always have zero count. Setting normalize will turn absolute counts into relative percentages.

Specifying entity_type is intended to be used only with normalization. If used without normalize set it will have no effect except for possible before mentioned zeroing. When used with normalize set result percentages will reflect counts only within specified concept_type instead of the whole dataset.

Parameters:	entity (tuple, list, set or str) – Entity name or tuple/list/set of entity names entity_type (str) – Optional filter for entity type to consider normalize (bool) – Returns relative frequencies if normalize is True
Returns:	pandas.Series – Entity names as index entity frequencies as values sorted as input if it was tuple or list

entity_sentiment(entity)¶

Computes and return mean rating for given entity or entities if list, tuple or set is given. If input is list or tuple result Series is sorted as input was.

Parameters:	entity (tuple, list, set or str) – Name(s) of entity(ies) to compute mean sentiment for
Returns:	pandas.Series – Mean ratings for entities, np.nan if entity was not rated. Entity names are in index and their sentiments are values

entity_texts(entity)¶

Returns list of normalized texts where each entity can be found as a dictionary.

Parameters:	entity (tuple, list, set or str) – Name of entities to find in normalized texts
Returns:	dict – Map where keys are concept names and values are lists of normalized strings

least_common_entities(n=15, entity_type='', normalize=False)¶

Counts entities and returns n least frequent ones sorted by their count ascending. Counted entities can be filtered by their type using entity_type and returned counts can be normalized with normalize.

If both entity_type and normalize are specified entity ratios will be computed only from entity counts within given entity_type.

Parameters:	n (int) – Maximum number of least frequent entities to return entity_type (str) – Limit entities counts only to entities whose type starts with this string normalize (bool) – Returns relative frequencies if normalize is True
Returns:	pandas.Series – Entity names as index and their counts as values sorted descending

most_common_entities(n=15, entity_type='', normalize=False)¶

Counts entities and returns n most occurring ones sorted by their count descending. Counted entities can be filtered by their type using entity_type and returned counts can be normalized with normalize.

If both entity_type and normalize are specified entity ratios will be computed only from entity counts within given entity_type.

Parameters:	n (int) – Maximum number of most common entities to return entity_type (str) – Limit entities counts only to entities whose type starts with this string normalize (bool) – Returns relative frequencies if normalize is True
Returns:	pandas.Series – Entity names as index and their counts as values sorted descending

surface_strings(entity)¶

Returns list of surface strings for each entity specified in entity as a dictionary.

Parameters:	entity (tuple, list, set or str) – Name of entities to find in normalized texts
Returns:	dict – Map where keys are entity names and values are lists of normalized strings

worst_rated_entities(n=15, entity_type='')¶

Find n worst rated entities in this dataset sorted ascending by their mean rating.

Parameters:	n (int) – Maximum count of returned entities entity_type (str) – Optional filter for entity type to consider
Returns:	pandas.DataFrame – Worst rated entities in this dataset as index and their ratings as values sorted ascending

anacode.agg.plotting ¶

anacode.agg.plotting.barhchart(aggregation, path=None, color='dull green', title=None)¶

Plots result from aggregation output in form of horizontal bar chart.

Parameters:	aggregation (pd.Series) – anacode.agg aggregation result path (str) – If specified graph will be saved to this file instead of returning it as a result color (str) – seaborn named color for bars title (str) – Title of chart; set to None for automatic title, empty string for no title
Returns:	matplotlib.axes._subplots.AxesSubplot – Axes for generated plot or None if graph was saved to file

anacode.agg.plotting.piechart(aggregation, path=None, colors=None, category_count=6, explode=0, edgesize=0, edgecolor='#333333', perc_color='black', labeldistance=1.1)¶

Plots piechart with categories.

Parameters:

aggregation (pd.Series) – Aggregation library result
path (str) – If specified graph will be saved to this file instead of returning it as a result
colors – This will be passed to matplotlib.pyplot.piechart as colors
category_count (int) – How many categories to include in piecharm
explode (float) – Size of whitespace between pie pieces
edgesize (int) – Pie’s edge size, set to 0 for no edge
edgecolor (matplotlib supported color) – Color of pie’s edge
perc_color (matplotlib supported color) – Controls color of percentages drawn inside piechart

Returns:

matplotlib.axes._subplots.AxesSubplot – Axes for generated plot or None if graph was saved to file

anacode.agg.plotting.concept_cloud(aggregation, path=None, size=(600, 400), background='white', colormap_name='Accent', max_concepts=200, stopwords=None, font=None)¶

Generates concept cloud image from frequencies and stores it to path. If path is None, returns image as np.ndarray instead. One way to view resulting image is to use matplotlib’s imshow method.

Parameters:

frequencies (list) – List of (concept: str, frequency: int) pairs to plot
path (str) – Save plot to this file. Set to None if you want raw image np.ndarray of this plot as a return value
size (tuple) – Size of plot in pixels as tuple (width: int, height: int)
background (matplotlib color definition) – Name of background color
colormap_name (str) – Name of matplotlib colormap that will be used to sample random colors for concepts in plot
max_concepts (int) – Maximum number of concepts that will be plotted
stopwords (iter) – Optionally set stopwords to use for the plot
font (str) – Path to font that will be used

Returns:

matplotlib.axes._subplots.AxesSubplot – Axes for generated plot or None if graph was saved to file

Library Docstrings¶

anacode.api ¶

Writers ¶

Querying ¶

anacode.agg ¶

Dataset loader ¶

API Datasets ¶

anacode.agg.plotting ¶

Table Of Contents

This Page