.. _intro: Anacode Toolkit ############### This library is a helper tool for users of the `Anacode Web&Text API `_, a REST API for Chinese web data collection and Natural Language Processing. The following operations are possible with the library: 1. Abstraction of HTTP protocol that is used by Anacode Web&Text API. Besides, concurrent API querying is made simple. 2. Conversion of JSON analysis results into flat table structures. 3. Common aggregation and selection tasks that can be performed on API analysis results, like finding the most discussed concepts or ten best-rated entities 4. Convenient plotting functions for aggregated results, ready to use in print documents. The first two features are covered by the module :mod:`anacode.api`; 3. and 4. are covered by :mod:`anacode.agg`. .. contents:: :local: Installation ************ The library is published via PyPI and works on python2.7 and python3.3+. To install from PyPI simply use pip: .. code-block:: shell pip install anacode You can also clone its repository and install from source using the ``setup.py`` script: .. code-block:: shell git clone https://github.com/anacode/anacode-toolkit.git cd anacode-toolkit python setup.py install Using Anacode API and saving results (anacode.api) ************************************************** Querying the API ================ The :mod:`anacode.api` module provides functionality for http communication with the Anacode Web&Text API. The class :class:`anacode.api.client.AnacodeClient` can be used to analyze Chinese texts. .. code-block:: python >>> from anacode.api import client >>> # base_url is optional >>> api = client.AnacodeClient( >>> 'token', base_url='https://api.anacode.de/') >>> # this will create an http request for you, send it to appropriate >>> # endpoint, parse the result and return it in a python dict >>> json_analysis = api.analyze(['储物空间少', '后备箱空间不小'], ['concepts']) There is also a class :class:`anacode.api.client.Analyzer` to perform bulk querying. It can used multiple threads and saves the results either to pandass dataframes or csv files. However, it is not intended for direct usage - instead, please use the interface to it that is covered in :ref:`using-analyzer`. Saving results ============== Since there is no analysis tool that can analyse arbitrary json schemas well, the toolkit offers a simple way to convert lists of API JSON results to a standard SQL-like data structure. There are two possibilities: you can convert your output to a `pandas.DataFrames `_ or store it to disk in csv files, making it ready to be input into various data processing programs such as Excel. The JSON > CSV conversion code lives in :mod:`anacode.api.writers`. You are not expected to use it directly, but here is quick example how to load sentiment analysis results into memory as a dataframe. .. code-block:: python >>> from anacode.api import writers >>> sentiment_json_output_0 = [ >>> {"sentiment_value": 0.7}, >>> {"sentiment_value": -0.1}, >>> ] >>> sentiment_json_output_1 = [ >>> {"sentiment_value": 0.34}, >>> ] >>> df_writer = writers.DataFrameWriter() >>> df_writer.init() >>> df_writer.write_sentiment(sentiment_json_output_0) >>> df_writer.write_sentiment(sentiment_json_output_1) >>> df_writer.close() >>> df_writer.frames['sentiments'] .. parsed-literal:: doc_id text_order sentiment_value 0 0 0 0.7 1 0 1 -0.1 2 1 0 0.34 The schemas of the tables are described in :ref:`analysed-schema`. Both :class:`anacode.api.writers.DataFrameWriter` and :class:`anacode.api.writers.CSVWriter` have the same interface. They generate document ids (doc_id) incrementally and separately for ``analyze`` and ``scrape``. That means that document id gets incremented each time you successfully receive an analysis/scrape result from API. .. _using-analyzer: Using analyzer ============== If you want to analyze a larger number of texts and store the analysis results to a csv file, you can use the :func:`anacode.api.client.analyzer` function. It provides an easy interface to bulk querying and storing results in a table-like data structure. The following code snippet analyses categories and sentiment for all `documents` in a single thread by bulks of size 100 and saves the resulting csv files to the folder ``ling/``. .. code-block:: python >>> from anacode.api import client >>> documents = [ >>> ['Chinese text 1', 'Chinese text 2'], >>> ['...'], >>> ] >>> with client.analyzer('token', 'ling') as api: >>> for document in documents: >>> api.analyze(document, ['categories', 'sentiment']) By contrast, below code snippet analyses categories and sentiment for all `documents` in two threads by bulks of size 200 and saves the output as pandas DataFrames to provided dictionary. .. code-block:: python >>> from anacode.api.client import analyzer >>> documents = [ >>> ['Chinese text 1', 'Chinese text 2'], >>> ['...'], >>> ] >>> output_dict = {} >>> with analyzer('token', output_dict, threads=2, bulk_size=200) as api: >>> for document in documents: >>> api.analyze(document, ['categories', 'sentiment']) >>> print(output_dict.keys()) .. parsed-literal:: dict_keys(['concepts', 'concepts_surface_strings', 'sentiments']) Single document mode ==================== Anacode API supports sending list of texts in two different modes. By default each text from list is considered to be one separate document. This means that categories and sentiment analysis are applied for each text from the list separately. You can make Anacode API consider texts as parts of one document by setting *single_document* switch of :func:`analyze `. This will result in server performing just one categories and just one sentiment analysis on all the texts together. You can read about this behavior in `API documentation `_. On Anacode Toolkit side of things setting *single_document* mode will result in *text_order* being used to mark different paragraphs of the bigger document in csv and DataFrame output of analysis. Aggregation framework (anacode.agg) *********************************** Data loading ============ The Anacode Toolkit provides the :class:`anacode.agg.aggregation.DatasetLoader` for loading analysed data from different formats: #. From analysis result of API If you have either result dictionary from anacode API or list of these results you can load them to memory using :func:`DatasetLoader.from_api_result `. .. code-block:: python >>> from anacode.agg import DatasetLoader >>> from anacode.api import client >>> api = client.AnacodeClient('') >>> result1 = api.analyze(['...'], analyses=['concepts']) >>> single_result_dataset = DatasetLoader.from_api_result(result1) >>> result2 = api.analyze(['...'], analyses=['concepts']) >>> whole_result_dataset = DatasetLoader.from_api_result([result1, result2]) #. Path to folder with csv files If you stored the analysis results in csv files (using :class:`anacode.api.writers.CSVWriter`), you can provide the path to their parent folder to :func:`DatasetLoader.from_path ` to load all available results. If you want to load older backed-up csv files, you can use *backup_suffix* argument of the method to specify suffix of files to load. #. From :class:`anacode.api.writers.Writer` instance If you used an instance of *Writer* (either *DataFrameWriter* or *CSVWriter*) to store the analysis results, you can pass a reference to it to the :func:`DatasetLoader.from_writer ` class method. #. From ``pandas`` dataframes You can also use *DatasetLoader*'s :func:`DatasetLoader.__init__ ` which simply takes an iterable of *pandas.DataFrame* objects with analyzed data. Accessing analysis data ======================= There are two ways to access the analysis results from :class:`DatasetLoader `. First, you can access *pandas.DataFrame* directly using :func:`DatasetLoader.__getitem__ `, as follows: `absa_texts = dataset['absa_normalized_texts']`. The format of these data frames is described below. Second, you can get higher-level access to the separate datasets via :func:`DatasetLoader.categories `, :func:`DatasetLoader.concepts `, :func:`DatasetLoader.sentiments ` or :func:`DatasetLoader.absa `. The latter returns :class:`anacode.agg.aggregation.ApiCallDataset` instances and actions you can perform with it will be explained in the next chapter. Text order field ---------------- In `all calls documentation `_ you can notice that they take not a single text for analysis but list of texts. Every call also returns list of analysis, one for each text given. *text-order* property in csv row defines index of analysis in this list that produced the row. That means that you can use text-order column to match analysis results to specific pieces of text that you sent to the API for analysis. .. _analysed-schema: Table schema ------------ In this section, we describe the table schema of the analysis results for each of the four calls. Categories """""""""" **categories.csv** categories.csv will contain one row per supported category name per text. You can find out more about category classification in `its documentation `_ - *doc_id* - document id generated incrementally - *text_order* - index to original input text list - *category* - category name - *probability* - float in range <0.0, 1.0> The probabilities for all categories for a given text sum up to 1. Concepts """""""" **concepts.csv** - *doc_id* - document id generated incrementally - *text_order* - index to original input text list - *concept* - name of concept - *freq* - frequency of occurrences of this concept in the text - *relevance_score* - relative relevance of the concept in this text - *concept_type* - type of concept (cf. `here `_ for list of available concept types) **concept_surface_strings.csv** concept_surface_strings.csv extends concepts.csv with surface strings that were used in text that realize it’s concepts - *doc_id* - document id generated incrementally - *text_order* - index to original input text list - *concept* - concept identified by anacode nlp - *surface_string* - string found in original text that realizes this concept - *text_span* - string index to original text where you can find this concept Note that if concept is used multiple times in original text there will be multiple rows with it in this file. Sentiment """"""""" **sentiment.csv** - *doc_id* - document id generated incrementally - *text-order* - index to original input text list - *sentiment_value* - evaluation of document sentiment; values are from [-1,1] ABSA """" **absa_entities.csv** - *doc_id* - document id generated incrementally - *text_order* - index to original input text list - *entity_name* - name of the entity - *entity_type* - type of the entity - *surface_string* - string found in original text that realizes this entity - *text_span* - string index in original text where surface_string can be found **absa_normalized_text.csv** - *doc_id* - document id generated incrementally - *text_order* - index to original input text list - *normalized_text* - text with normalized casing and whitespace **absa_relations.csv** - *doc_id* - document id generated incrementally - *text_order* - index to original input text list - *relation_id* - since the absa relation output can have multiple relations, we introduce relation_id as a foreign key - *opinion_holder* - optional; if this field is null, the default opinion holder is the author himself - *restriction* - optional; contextual restriction under which the evaluation applies - *sentiment_value* - polarity of evaluation, has values from [-1, 1] - *is_external* - whether an external entity was defined for this relation - *surface_string* - original text that generated this relation - *text_span* - string index in original text where surface_string can be found **absa_relations_entities.csv** This table is extending absa_relations.csv by providing list of entities connected to evaluations in it. - *doc_id* - document id generated incrementally - *text_order* - index to original input text list - *relation_id* - foreign key to absa_relations - *entity_type* - - *entity_name* - **absa_evaluations.csv** - *doc_id* - document id generated incrementally - *text_order* - index to original input text list - *evaluation_id* - absa evaluations output can rate multiple entities, this serves as foreign key to them - *sentiment_value* - numeric value how positive/negative statement is; from [-1, 1] - *surface_string* - original text that was used to get this evaluation - *text_span* - string index in original text where surface_string can be found **absa_evaluations_entities.csv** - *doc_id* - document id generated incrementally - *text_order* - index to original input text list - *evaluation_id* - foreign key to absa_evaluations - *entity_type* - - *entity_name* - Aggregations ============ The Anacode Toolkit provides set of common aggregations over the analysed data. These are accessible from the four subclasses of :class:`ApiCallDataset ` - :class:`CategoriesDataset `, :class:`ConceptsDataset `, :class:`SentimentDataset ` and :class:`ABSADataset `. You can get any of those using the corresponding properties of the class :class:`DatasetLoader ` (:func:`categories `, :func:`concepts `, :func:`sentiments ` and :func:`absa `). Here is a list of aggregations and some other convenience methods with descriptions and usage examples that can be performed for each api call dataset. ConceptsDataset --------------- .. _concept_frequency_agg: - :func:`concept_frequency(concept, concept_type='', normalize=False) ` Concepts are returned in the same order as they were in input. .. code-block:: python >>> concept_list = ['CenterConsole', 'MercedesBenz', >>> 'AcceleratorPedal'] >>> concepts.concept_frequency(concept_list) .. parsed-literal:: Concept CenterConsole 27 MercedesBenz 91 AcceleratorPedal 39 Name: Count, dtype: int64 Limiting concept_type may zero out counts: .. code-block:: python >>> concepts.concept_frequency( >>> concept_list, concept_type='feature') .. parsed-literal:: Feature CenterConsole 27 MercedesBenz 0 AcceleratorPedal 39 Name: Count, dtype: int64 The next two code samples demonstrate how percentages can change if concept_type filter changes. .. code-block:: python >>> concepts.concept_frequency(concept_list, normalize=True) .. parsed-literal:: Concept CenterConsole 0.005560 MercedesBenz 0.018740 AcceleratorPedal 0.008031 Name: Count, dtype: float64 .. code-block:: python >>> concepts.concept_frequency( >>> concept_list, concept_type='feature', normalize=True) .. parsed-literal:: Feature CenterConsole 0.009174 MercedesBenz 0.000000 AcceleratorPedal 0.013252 Name: Count, dtype: float64 - :func:`most_common_concepts(n=15, concept_type='', normalize=False) ` .. code-block:: python >>> concepts.most_common_concepts(n=3) .. parsed-literal:: Concept Automobile 533 BMW 381 VisualAppearance 241 Name: Count, dtype: int64 Also read about :ref:`concept_frequency ` to see how concept_type and normalize can change output. - :func:`least_common_concepts(n=15, concept_type='', normalize=False) ` .. code-block:: python >>> concepts.least_common_concepts(n=3) .. parsed-literal:: Concept 30 1 Lepow 1 Lid 1 Name: Concept, dtype: int64 Also read about :ref:`concept_frequency ` to see how concept_type and normalize can change output. - :func:`co_occurring_concepts(concept, n=15, concept_type='') ` .. code-block:: python >>> concepts.co_occurring_concepts('VisualAppearance', n=5, >>> concept_type='feature') .. parsed-literal:: Feature Interior 33 Body 26 Comfort 17 Space 17 RearEnd 16 Name: Count, dtype: int64 Also read about :ref:`concept_frequency ` to see how concept_type can change output. - :func:`nltk_textcollection(concept_type='') ` Creates nltk.text.TextCollection containing concepts found by linguistic analysis. - :func:`make_idf_filter(threshold, concept_type='') ` Creates IDF filter from concepts found by linguistic analysis. You can read more about IDF filtering on many places, for your convenience we provide a link to `stanford webpage `_. - :func:`make_time_series(concepts, date_info, delta, interval=None) ` You will have to provide date_info dictionary to this function. The keys of date_info correspond to consecutive integers; the values correspond to :class:`datetime.date` objects: .. code-block:: python >>> print(date_info) .. parsed-literal:: {0: datetime.date(2016, 1, 1), 1: datetime.date(2016, 1, 2), 2: datetime.date(2016, 1, 3), 3: datetime.date(2016, 1, 4), 4: datetime.date(2016, 1, 5), 5: datetime.date(2016, 1, 6), ... } When you are using scraped data from Anacode in json format you can build the dictionary by looping over documents with date field, parsing it and storing it in dictionary under index of the document like this: .. code-block:: python >>> from datetime import datetime >>> date_info = {} >>> for index, doc in enumerate(scraped_json_data): >>> if not doc['date']: >>> continue >>> date_info[index] = datetime.strptime(d['date'], '%Y-%m-%d') When you have date_info dictionary generating time series is simple. Keep in mind that resulting time series ticks include it's starting date and exclude ending date. So a tick who starts at *Start* and ends at *Stop* will include these: `Start <= concept's document time < Stop`. .. code-block:: python >>> concepts.make_time_series(['Body'], date_info, >>> timedelta(days=100)) .. parsed-literal:: Count Concept Start Stop 0 89 Body 2016-01-01 2016-04-10 1 25 Body 2016-04-10 2016-07-19 2 2 Body 2016-07-19 2016-10-27 3 3 Body 2016-10-27 2017-02-04 When you limit interval (start and stop of ticks) and you specify delta such that `start + K * delta = stop` cannot be solved the stop will stretch to the first following date for which the formula can be solved. For instance setting start to 2016-01-01 and stop to 2016-01-07 and delta to 4 days, stop will be changed to 2016-01-09. .. code-block:: python >>> concepts.make_time_series(['Body'], date_info, >>> timedelta(days=4), >>> (date(2016, 1, 1), date(2016, 1, 7))) .. parsed-literal:: Count Concept Start Stop 0 3 Body 2016-01-01 2016-01-05 1 2 Body 2016-01-05 2016-01-09 - :func:`concept_cloud(path, size=(600, 350), background='white', colormap_name='Accent', max_concepts=200, stopwords=None, concept_type='', concept_filter=None, font=None) ` This function generates a concept cloud image and stores it either to a file file or to a numpy ndarray. Here is simple example for generating an ndarray: .. code-block:: python >>> concept_cloud_img = concepts.concept_cloud(path=None) CategoriesDataset ----------------- - :func:`categories() ` You can check list of categories on `api.anacode.de webpage `_. Each category will be present in output. .. code-block:: python >>> categories.categories() .. parsed-literal:: Probability auto 0.3155102 hr 0.02371 ... - :func:`main_category() ` .. code-block:: python >>> categories.main_category() .. parsed-literal:: 'auto' SentimentsDataset ----------------- - :func:`average_sentiment() ` .. code-block:: python >>> sentiments.average_sentiment() .. parsed-literal:: 0.43487262467141063 ABSADataset ----------- - :func:`entity_frequency(entity, entity_type='', normalize=False) ` .. code-block:: python >>> absa.entity_frequency(['Oil', 'Buying']) .. parsed-literal:: Entity Oil 62 Buying 80 Name: Count, dtype: int64 Also read about :ref:`concept_frequency ` to see how entity_type and normalize can change the output. - :func:`most_common_entities(n=15, entity_type='', normalize=False) ` .. code-block:: python >>> absa.most_common_entities(n=2) .. parsed-literal:: Entity Automobile 538 BMW 384 Name: Count, dtype: int64 Also read about :ref:`concept_frequency ` to see how entity_type and normalize can change output. - :func:`least_common_entities(n=15, entity_type='', normalize=False) ` .. code-block:: python >>> absa.least_common_entities(n=2) .. parsed-literal:: Entity FashionStyle 1 Room 1 Name: entity_name, dtype: int64 Also read about :ref:`concept_frequency ` to see how entity_type and normalize can change output. - :func:`co_occurring_entities(entity, n=15, entity_type='') ` .. code-block:: python >>> absa.co_occurring_entities('Oil', n=5, >>> entity_type='feature_') .. parsed-literal:: Feature FuelConsumption 32 Power 28 Acceleration 10 Size 9 Body 6 Name: Count, dtype: int64 Also read about :ref:`concept_frequency ` to see how entity_type can change output. - :func:`best_rated_entities(n=15, entity_type='') ` .. code-block:: python >>> absa.best_rated_entities(n=1) .. parsed-literal:: Entity X5 1.0 Name: Sentiment, dtype: float64 Also read about :ref:`concept_frequency ` to see how entity_type can change output. - :func:`worst_rated_entities(n=15, entity_type='') ` .. code-block:: python >>> absa.worst_rated_entities(n=2) .. parsed-literal:: Entity Compartment -1.0 Black -0.81 Name: Sentiment, dtype: float64 Also read about :ref:`concept_frequency ` to see how entity_type can change output. - :func:`surface_strings(entity) ` .. code-block:: python >>> absa.surface_strings('ShockAbsorption') .. parsed-literal:: {'ShockAbsorption': ['减震效果也非常好', '减震效果和隔音效果也很好', '减震效果也很好']} - :func:`entity_texts(entity) ` .. code-block:: python >>> absa.entity_texts(['Room', 'FashionStyle']) .. parsed-literal:: {'FashionStyle': ['外观很满意,外形稍显低调,但不缺乏时尚动感,整车的线条体现更是完整,看起来更为流畅,开眼角大灯我也比较喜欢,这车感觉就像一个穿着休闲西服的长腿欧巴,时而稳重,时而动感'], 'Room': ['外观好看,室内舒适。']} - :func:`entity_sentiment(entity) ` .. code-block:: python >>> absa.entity_sentiment({'Oil', 'Seats', 'Room'}) .. parsed-literal:: Entity Oil 0.750002 Room 0.201 Seats 0.55238 Name: Sentiment, dtype: float64 Plotting ======== Most of the aggregation results from previous section can be rendered as a graph using :mod:`anacode.agg.plotting`. The module knows how to plot three types of graphs - :func:`horizontal barchart `, :func:`piechart ` and :func:`word cloud `. Generally all results that have a meaningful graph representation can be plotted using :func:`anacode.agg.plotting.barhchart`. Aggregation that do not have graph representation currently are :func:`nltk_textcollection `, :func:`make_idf_filter `, :func:`make_time_series `, :func:`main_category `, :func:`average_sentiment `, :func:`surface_strings ` and :func:`entity_texts ` - all other aggregation method results can be plotted as horizontal bar chart with :func:`barhchart `. Only :func:`CategoriesDataset.categories ` aggregation method is viable to be rendered as piechart and :func:`ConceptsDataset.concept_frequencies ` is the only aggregation method viable to be rendered as concept could. .. code-block:: python >>> import matplotlib.pyplot as plt >>> from anacode.agg import plotting >>> concept_frequencies = concepts.concept_frequencies() >>> plotting.concept_cloud(concept_frequencies) >>> plt.show() .. figure:: _static/images/word_cloud.png .. code-block:: python >>> from anacode.agg import plotting >>> co_occuring = absa.co_occurring_entities('Seats') >>> plotting.barhchart(co_occuring) .. figure:: _static/images/co_occuring.png