Library Docstrings¶
The Anacode Toolkit library consists of the two modules anacode.api
and anacode.agg
. anacode.api
simplifies the use of the API, whereas anacode.agg
provides functionality for further analysis, aggregation and visualization of the results.
anacode.api¶
Writers¶
-
class
anacode.api.writers.
Writer
¶ Base “abstract” class containing common methods that are needed by all implementations of Writer interface.
The writer interface consists of init, close and write_bulk methods.
-
close
()¶ Not implemented here! Each subclass should decide what to do here.
-
init
()¶ Not implemented here! Each subclass should decide what to do here.
-
write_absa
(analyzed, single_document=False)¶ Converts absa analysis result to flat lists and stores them.
Parameters: - analyzed (list) – JSON absa analysis result
- single_document (bool) – Is analysis describing just one document
-
write_analysis
(analyzed)¶ Inspects analysis result for performed analysis and delegates persisting of results to appropriate write methods.
Parameters: analyzed – JSON object analysis response Type: dict
-
write_bulk
(results)¶ Stores multiple anacode api’s JSON responses marked with call IDs as tuples (call_id, call_result). Both scrape and analyze call IDs are defined in anacode.codes module.
Parameters: results (list) – List of anacode responses with IDs of calls used
-
write_categories
(analyzed, single_document=False)¶ Converts categories analysis result to flat lists and stores them.
Parameters: - analyzed (list) – JSON categories analysis result
- single_document (bool) – Is analysis describing just one document
-
write_concepts
(analyzed, single_document=False)¶ Converts concepts analysis result to flat lists and stores them.
Parameters: - analyzed (list) – JSON concepts analysis result
- single_document (bool) – Is analysis describing just one document
-
-
class
anacode.api.writers.
CSVWriter
(target_dir='.')¶
-
class
anacode.api.writers.
DataFrameWriter
(frames=None)¶ Writes Anacode API output into pandas.DataFrame instances.
Querying¶
-
class
anacode.api.client.
AnacodeClient
(auth, base_url='https://api.anacode.de/')¶ Makes posting data to server for analysis simpler by storing user’s auth, the URL of the Anacode API server and paths for analysis calls.
To find out more about specific API calls and analyses and their output format, please refer to https://api.anacode.de/api-docs/calls.html.
-
__init__
(auth, base_url='https://api.anacode.de/')¶ Default value for base_url is taken from environment variable ANACODE_API_URL if set; otherwise, ‘https://api.anacode.de/‘ is used.
Parameters:
-
analyze
(texts, analyses, external_entity_data=None, single_document=False)¶ Use Anacode API to perform specified linguistic analysis on texts. Please consult https://api.anacode.de/api-docs/calls.html for more details and better understanding of parameters.
Parameters: - texts – List of texts to analyze
- analyses – List of analysss to perform. Can contain ‘categories’, ‘concepts’, ‘sentiment’ and ‘absa’
- external_entity_data – Provide additional entities to relate to sentiment evaluation.
- single_document (bool) – Makes API treat texts as paragraphs of one document instead of treating them as separate documents
Returns: dict –
-
-
class
anacode.api.client.
Analyzer
(client, writer, threads=1, bulk_size=100)¶ This class makes querying with multiple threads and storing in other formats then list of json-s simple.
-
__init__
(client, writer, threads=1, bulk_size=100)¶ Parameters: - client (
anacode.api.client.AnacodeClient
) – Will be used to post analysis to anacode api - writer (
anacode.api.writers.Writer
) – Needs to implement init, close and write_bulk methods from Writer interface - threads (int) – Number of concurrent threads to use, defaults to 1
- bulk_size (int) – How often should writer’s write_bulk method be invoked, defaults to 100
- client (
-
analyze
(texts, analyses, external_entity_data=None, single_document=False)¶ Dummy clone for
anacode.api.client.AnacodeClient.analyze()
-
analyze_bulk
()¶ Performs bulk analysis. Will use
multiprocessing.dummy.Pool
to post data to anacode api if number of threads is more than one.Analysis results are not returned, but cached internally.
-
flush_analysis_data
()¶ Writes all cached analysis results using writer.
-
scrape
(link)¶ Dummy clone for
anacode.api.client.AnacodeClient.scrape()
-
should_start_analysis
()¶ Checks how many tasks are in queue and returns boolean indicating whether analysis should be performed.
Returns: bool – True if analysis should happen now, False otherwise
-
-
anacode.api.client.
analyzer
(auth, writer, threads=1, bulk_size=100, base_url='https://api.anacode.de/')¶ Convenient function for initializing bulk analyzer and potentially temporary writer instance as well.
Parameters: - auth (str) – User’s token string
- threads (int) – Number of threads to use for https communication with server
- writer (str) – Writer instance that will store analysis results or path to folder where csv-s should be saved or dictionary where data frames should be stored
- bulk_size (int) –
- base_url (str) – Anacode API server URL
Returns: anacode.api.client.Analyzer
– Bulk analyzer instance
anacode.agg¶
Dataset loader¶
-
class
anacode.agg.aggregation.
DatasetLoader
(concepts=None, concepts_surface_strings=None, categories=None, sentiments=None, absa_entities=None, absa_normalized_texts=None, absa_relations=None, absa_relations_entities=None, absa_evaluations=None, absa_evaluations_entities=None)¶ Loads analysed data obtained via Anacode API from various formats.
-
__getitem__
(item)¶ If item is the name of linguistic dataset known to DatasetLoader, it will return the corresponding dataset. If the dataset is not found, None is returned. If item is not recognized, a KeyError is thrown.
Parameters: item (str) – possible values: categories, concepts, concepts_surface_strings, sentiments, absa_entities, absa_normalized_texts, absa_relations, absa_relations_entities, absa_evaluations, absa_evaluations_entities Returns: pandas.DataFrame – DataFrame with requested data if found, else None
-
__init__
(concepts=None, concepts_surface_strings=None, categories=None, sentiments=None, absa_entities=None, absa_normalized_texts=None, absa_relations=None, absa_relations_entities=None, absa_evaluations=None, absa_evaluations_entities=None)¶ Will construct DatasetLoader instance that is aware of what data is available to it. Raises ValueError if no data was provided.
Data frames are expected to have format that corresponds to format that
anacode.api.writers.Writer
would write.Parameters: - concepts (pandas.DataFrame) – List of found concepts with metadata
- concepts_surface_strings (pandas.DataFrame) – List of strings realizing concepts
- categories (pandas.DataFrame) – List of document category probabilities
- sentiments (pandas.DataFrame) – List of document sentiment polarities
- absa_entities (pandas.DataFrame) – List of absa entities used in texts
- absa_normalized_texts (pandas.DataFrame) – List of Chinese normalized strings identified and analyzed by absa
- absa_relations (pandas.DataFrame) – List of absa relations with metadata
- absa_relations_entities (pandas.DataFrame) – List of absa entities used in relations
- absa_evaluations (pandas.DataFrame) – List of absa evaluations
- absa_evaluations_entities (pandas.DataFrame) – List of absa entities used in evaluations
-
absa
¶ Creates new ABSADataset if data is available.
Returns: anacode.agg.aggregations.ABSADataset
–
-
categories
¶ Creates new CategoriesDataset if data is available.
Returns: anacode.agg.aggregations.CategoriesDataset
–
-
concepts
¶ Creates new ConceptsDataset if data is available.
Returns: anacode.agg.aggregations.ConceptsDataset
–
-
filter
(document_ids)¶ Creates new DatasetLoader instance using data only from documents with ids in document_ids.
Parameters: document_ids (iterable) – Iterable with document ids. Cannot be empty. Returns: DatasetLoader – New DatasetLoader instance with data only from desired documents
-
classmethod
from_api_result
(result)¶ Initializes DatasetLoader from API JSON output. Works with both single analysis result and with list of analyses results.
Parameters: result – Either single API JSON analysis dict or list of them Returns: anacode.agg.DatasetLoader
– DatasetLoader with available analysis data loaded
-
classmethod
from_path
(path, backup_suffix='')¶ Initializes DatasetLoader from AnacodeAPI csv files present in given path. You could have obtained these by using
anacode.api.writers.CSVWriter
to write your request results when you were querying AnacodeAPI.Parameters: Returns: anacode.agg.DatasetLoader
– DatasetLoader with found csv files loaded into data frames
-
classmethod
from_writer
(writer)¶ Initializes DatasetLoader from writer instance that was used to store anacode analysis. Accepts both
anacode.api.writers.DataFrameWriter
andanacode.api.writers.CSVWriter
.Parameters: writer (anacode.api.writers.Writer) – Writer that was used by anacode.api.client.Analyzer
to store analysisReturns: anacode.agg.DatasetLoader
– DatasetLoader with available data frames loaded
-
remove_concepts
(concepts)¶ Remove given concepts from dataset if they are present.
Parameters: concepts (iterable) – These concepts will be removed from dataset
-
sentiments
¶ Creates new SentimentDataset if data is available.
Returns: anacode.agg.aggregations.SentimentDataset
–
-
API Datasets¶
-
class
anacode.agg.aggregation.
ApiCallDataset
¶ Base class for specific call data sets.
-
class
anacode.agg.aggregation.
CategoriesDataset
(categories)¶ Categories dataset container with easy aggregation capabilities.
-
__init__
(categories)¶ Initialize instance by providing categories data set.
Parameters: categories (pandas.DataFrame) – List of document category probabilities
-
categories
()¶ Aggregates categories across the whole dataset.
Returns: pandas.Series –
-
main_category
()¶ Finds the main category of a dataset.
Returns: str – Name of main category.
-
-
class
anacode.agg.aggregation.
ConceptsDataset
(concepts, surface_strings)¶ Concept dataset container with easy aggregation capabilities.
-
__init__
(concepts, surface_strings)¶ Initialize instance by providing the two dataframes required for concepts representation.
Parameters: - concepts (pandas.DataFrame) – List of found concepts with metadata
- surface_strings (pandas.DataFrame) – List of strings realizing found concepts
-
co_occurring_concepts
(concept, n=15, concept_type='')¶ Find n concepts co-occurring frequently in texts of this dataset with given concept, sorted by descending frequency. Co-occurring concepts can be filtered by their type.
Parameters: Returns: pandas.Series – Co-occurring concept names as index and their frequencies sorted by descending frequency
-
concept_frequencies
(max_concepts=200, concept_type='', concept_filter=None)¶ Returns pandas series with counts for all concepts from the dataset.
To filter words that will be showed in the cloud you can use concept_type and concept_filter. The former is specific type of concepts that you only want to have present in the result and the latter is callable that takes concept name and returns bool to indicate whether given concept should pass the filter. You can set both at the same time. concept_type is applied first, concept_filter second.
Parameters: - max_concepts (int) – Maximum number of concepts that will be plotted
- concept_type (str) – Limit concepts only to concepts whose type starts with this string
- concept_filter (callable) – If not None given callable needs to accept one string parameter that is concept name and evaluate it if it should pass the filter - callable returns True - or not - callable returns False. Only concepts that pass can be seen on resulting concept cloud image
Returns: pandas.Series – Concept names as index and their counts as values
-
concept_frequency
(concept, concept_type='', normalize=False)¶ Return occurrence count of input concept or concept list. Resulting list has concepts sorted just like they were in input if it was list or tuple. Concepts that are not of concept_type or that are not in the dataset will always have zero count. Setting normalize will turn absolute counts into relative percentages.
Specifying concept_type is intended to be used only with normalization. If used without normalize set it will have no effect except for concepts that do not have said type whose count will be zero. When used with normalize set percentages will reflect counts only within specified concept_type instead of the whole dataset.
Parameters: Returns: pandas.Series – Concept names as index and their counts as values sorted as they were in input.
-
least_common_concepts
(n=15, concept_type='', normalize=False)¶ Counts concepts and returns n least frequent ones sorted by their count ascending. Counted concepts can be filtered by their type using concept_type and returned counts can be normalized with normalize.
If both concept_type and normalize are specified, concept ratios will be computed only from concept counts within given concept_type.
Parameters: Returns: pandas.Series – Concept names as index and their counts as values sorted ascending
-
make_idf_filter
(threshold, concept_type='')¶ Generates concept filter based on idf values of concepts in represented documents. This filter can be directly used as parameter for concept_cloud call.
Parameters: Returns: callable – Function that can be used as idf_func in concept_cloud
-
make_time_series
(concepts, date_info, delta, interval=None)¶ Creates DataFrame with counts for each concepts in every delta time tick that exists in interval. If you do not specify interval it will be computed from date_info to include all documents.
In concepts dataset there is no information about document release date so you will have to provide this information externally as date_info. It needs to be a map object that has all document ids from concept’s dataset as keys and they refer to datetime.date representing release date for the document.
Result will include 0 counts for ticks where concepts were not mentioned. In each row there will also be start and stop times for that particular count. Counts from stop time are not included in the tick.
Parameters: Returns: pandas.DataFrame – DataFrame with columns “Concept”, “Count”, “Start” and “Stop”
-
most_common_concepts
(n=15, concept_type='', normalize=False)¶ Counts concepts and returns n most occurring ones sorted by their count descending. Counted concepts can be filtered by their type using concept_type and returned counts can be normalized with normalize.
If both concept_type and normalize are specified concept ratios will be computed only from concept counts within given concept_type.
Parameters: Returns: pandas.Series – Concept names as index and their counts as values sorted descending
-
nltk_textcollection
(concept_type='')¶ Wraps concepts of each represented documents into nltk.text.Text and returns these wrapped in nltk.text.TextCollection.
Parameters: concept_type (str) – Limit gathered concepts only to this type of concepts Returns: nltk.text.TextCollection – TextCollection of represented documents
-
surface_forms
(concept, n=15)¶ Find n random surface strings from analyzed text that were identified as concept.
Parameters: - concept – Inspect this concept surface forms.
- n – Maximum number of unique surface forms returned
Returns: set – Set with maximum of n surface forms of concept
-
-
class
anacode.agg.aggregation.
SentimentDataset
(sentiments)¶ Sentiment dataset container with easy aggregation capabilities.
-
__init__
(sentiments)¶ Initialize instance by providing sentiments data set.
Parameters: sentiments (pandas.DataFrame) – List of document sentiment inclinations
-
average_sentiment
()¶ Computes and returns average document sentiment. Result is a number from [-1,1], where higher number means more positive sentiment.
Returns: float – Average document sentiment
-
-
class
anacode.agg.aggregation.
ABSADataset
(entities, normalized_texts, relations, relations_entities, evaluations, evaluations_entities)¶ ABSA data set container that will provides easy aggregation capabilities.
-
__init__
(entities, normalized_texts, relations, relations_entities, evaluations, evaluations_entities)¶ Initialize instance by providing all absa data sets.
Parameters: - entities (pandas.DataFrame) – List of entities used in texts
- normalized_texts (pandas.DataFrame) – List of chinese normalized texts
- relations (pandas.DataFrame) – List of relations with metadata
- relations_entities (pandas.DataFrame) – List of entities used in relations
- evaluations (pandas.DataFrame) – List of entity evaluations
- evaluations_entities (pandas.DataFrame) – List of entities used in evaluations
-
best_rated_entities
(n=15, entity_type='')¶ Find top n rated entities in this dataset sorted descending by their mean rating.
Parameters: Returns: pandas.Series – Best rated entities in this dataset as index and their ratings as values sorted descending
-
co_occurring_entities
(entity, n=15, entity_type='')¶ Find n entities co-occurring frequently in texts of this dataset with given entity, sorted descending. Co-occurring entities can be filtered by their type.
Parameters: Returns: pandas.Series – Co-occurring entity names as index and their counts as values sorted descending
-
entity_frequency
(entity, entity_type='', normalize=False)¶ Return occurrence count of input entity or entity list. Resulting list has entities sorted just like they were in input if it was list or tuple. Entities whose entity_type does not start with given one or that are not in the dataset will always have zero count. Setting normalize will turn absolute counts into relative percentages.
Specifying entity_type is intended to be used only with normalization. If used without normalize set it will have no effect except for possible before mentioned zeroing. When used with normalize set result percentages will reflect counts only within specified concept_type instead of the whole dataset.
Parameters: Returns: pandas.Series – Entity names as index entity frequencies as values sorted as input if it was tuple or list
-
entity_sentiment
(entity)¶ Computes and return mean rating for given entity or entities if list, tuple or set is given. If input is list or tuple result Series is sorted as input was.
Parameters: entity (tuple, list, set or str) – Name(s) of entity(ies) to compute mean sentiment for Returns: pandas.Series – Mean ratings for entities, np.nan if entity was not rated. Entity names are in index and their sentiments are values
-
entity_texts
(entity)¶ Returns list of normalized texts where each entity can be found as a dictionary.
Parameters: entity (tuple, list, set or str) – Name of entities to find in normalized texts Returns: dict – Map where keys are concept names and values are lists of normalized strings
-
least_common_entities
(n=15, entity_type='', normalize=False)¶ Counts entities and returns n least frequent ones sorted by their count ascending. Counted entities can be filtered by their type using entity_type and returned counts can be normalized with normalize.
If both entity_type and normalize are specified entity ratios will be computed only from entity counts within given entity_type.
Parameters: Returns: pandas.Series – Entity names as index and their counts as values sorted descending
-
most_common_entities
(n=15, entity_type='', normalize=False)¶ Counts entities and returns n most occurring ones sorted by their count descending. Counted entities can be filtered by their type using entity_type and returned counts can be normalized with normalize.
If both entity_type and normalize are specified entity ratios will be computed only from entity counts within given entity_type.
Parameters: Returns: pandas.Series – Entity names as index and their counts as values sorted descending
-
surface_strings
(entity)¶ Returns list of surface strings for each entity specified in entity as a dictionary.
Parameters: entity (tuple, list, set or str) – Name of entities to find in normalized texts Returns: dict – Map where keys are entity names and values are lists of normalized strings
-
worst_rated_entities
(n=15, entity_type='')¶ Find n worst rated entities in this dataset sorted ascending by their mean rating.
Parameters: Returns: pandas.DataFrame – Worst rated entities in this dataset as index and their ratings as values sorted ascending
-
anacode.agg.plotting¶
-
anacode.agg.plotting.
barhchart
(aggregation, path=None, color='dull green', title=None)¶ Plots result from aggregation output in form of horizontal bar chart.
Parameters: Returns: matplotlib.axes._subplots.AxesSubplot – Axes for generated plot or None if graph was saved to file
-
anacode.agg.plotting.
piechart
(aggregation, path=None, colors=None, category_count=6, explode=0, edgesize=0, edgecolor='#333333', perc_color='black', labeldistance=1.1)¶ Plots piechart with categories.
Parameters: - aggregation (pd.Series) – Aggregation library result
- path (str) – If specified graph will be saved to this file instead of returning it as a result
- colors – This will be passed to matplotlib.pyplot.piechart as colors
- category_count (int) – How many categories to include in piecharm
- explode (float) – Size of whitespace between pie pieces
- edgesize (int) – Pie’s edge size, set to 0 for no edge
- edgecolor (matplotlib supported color) – Color of pie’s edge
- perc_color (matplotlib supported color) – Controls color of percentages drawn inside piechart
Returns: matplotlib.axes._subplots.AxesSubplot – Axes for generated plot or None if graph was saved to file
-
anacode.agg.plotting.
concept_cloud
(aggregation, path=None, size=(600, 400), background='white', colormap_name='Accent', max_concepts=200, stopwords=None, font=None)¶ Generates concept cloud image from frequencies and stores it to path. If path is None, returns image as np.ndarray instead. One way to view resulting image is to use matplotlib’s imshow method.
Parameters: - frequencies (list) – List of (concept: str, frequency: int) pairs to plot
- path (str) – Save plot to this file. Set to None if you want raw image np.ndarray of this plot as a return value
- size (tuple) – Size of plot in pixels as tuple (width: int, height: int)
- background (matplotlib color definition) – Name of background color
- colormap_name (str) – Name of matplotlib colormap that will be used to sample random colors for concepts in plot
- max_concepts (int) – Maximum number of concepts that will be plotted
- stopwords (iter) – Optionally set stopwords to use for the plot
- font (str) – Path to font that will be used
Returns: matplotlib.axes._subplots.AxesSubplot – Axes for generated plot or None if graph was saved to file