ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries

Automatic chart to text summarization is an effective tool for the visually impaired people along with providing precise insights of tabular data in natural language to the user. A large and well-structured dataset is always a key part for data driven models. In this paper, we propose ChartSumm: a large-scale benchmark dataset consisting of a total of 84,363 charts along with their metadata and descriptions covering a wide range of topics and chart types to generate short and long summaries. Extensive experiments with strong baseline models show that even though these models generate fluent and informative summaries by achieving decent scores in various automatic evaluation metrics, they often face issues like suffering from hallucination, missing out important data points, in addition to incorrect explanation of complex trends in the charts. We also investigated the potential of expanding ChartSumm to other languages using automated translation tools. These make our dataset a challenging benchmark for future research.


Introduction
Automatic chart summarization is a task where the goal is to describe important data points and trends in a chart in natural language. Chart summaries are helpful to better interpret the chart, making it useful for the visually impaired people as well as to improve the performance of different information retrieval algorithms (Obeid and Hoque, 2020;Carenini et al., 2013;Li et al., 2013).
Scarcity of large scale well defined datasets with chart image, metadata and well described summaries is a major challenge in automatic chart summarization. To our best knowledge, there are only four datasets (Obeid and Hoque, 2020;Zhu et al., 2021a;Hsu et al., 2021;Kanthara et al., 2022) available for the chart to text summarization task, making this task a low resource problem. Among these datasets, three of them (Obeid and Hoque, 2020;Figure 1: An example chart-summary pair from our proposed dataset. Zhu et al., 2021b;Kanthara et al., 2022) contain chart images with metadata and well defined summaries while the SciCAP dataset (Hsu et al., 2021) only contains chart images and captions.
In this work, we address the scarcity of public datasets in the automatic chart summarization task. We propose "ChartSumm", a large scale dataset for chart to text summarization comprising of 84, 363 chart images with corresponding chart metadata and summaries. (see Figure 1 for an example). We also propose two test sets based on the summary length. In this paper, our major contributions are summarized below: (i) Proposing a new benchmark dataset for the automatic chart summarization task. To our best knowledge, our ChartSumm dataset is currently the largest dataset proposed for this task. Meanwhile, we also introduce two different test sets to separately compare the performance on generating short and long summaries.
(ii) Conducting a series of experiments using strong baselines to demonstrate how models trained on our dataset have better generalization capability than other existing datasets. In addition, we identify the limitations of the state-of-the-art models in our proposed dataset. Furthermore, we also explore the scope of expanding our dataset Chart2Text (Obeid and Hoque, 2020)  AutoChart (Zhu et al., 2021a)   to other languages through translation and evaluate the performance in a human-annotated test set in the Bengali language. To our best knowledge, this is the first work that investigated the Chart Summarization task in any languages except English. The dataset and codes are available at https://github.com/pranonrahman/ChartSumm.

Related Work
Existing Chart-To-Text summarization systems generate summaries from either the chart image (Hsu et al., 2021) or the chart metadata (Gong et al., 2019;Obeid and Hoque, 2020;Kanthara et al., 2022). Before the advent of deep learning, most early work utilized a two stage approach that applied content selection using different statistical tools in the first step followed by generating summaries using pre-defined templates (Reiter, 2007;Zhu et al., 2021b). However, predefined templatebased architectures frequently lack generality and fail to capture complex trends in data. In recent years, deep learning-based techniques have gained significant attention (Gong et al., 2019;Obeid and Hoque, 2020;Gajbhiye and Lopes, 2021;Zhu et al., 2021a;Hsu et al., 2020;Zhou et al., 2021;Chai et al., 2021;Kanthara et al., 2022) due to their superior performance over the templatebased approaches. Nonetheless, due to the lack of Chart-To-Text summarization datasets, not only that the models proposed for this task require improvements, the generalized effectiveness of these models is also yet to be investigated. Among the four benchmark Chart-To-Text summarization datasets that are publicly available, the Chart2Text (Obeid and Hoque, 2020) summarization dataset is the first dataset proposed for this task that includes 8,305 samples collected from the Statista 1 . However, the size of this dataset is quite small and so effective data-driven methods cannot 1 https://www.statista.com/ be trained on top of that. Later, the SciCAP (Hsu et al., 2021) data was proposed for the chart captioning task from chart images. Thus, it is not suitable for methods that can only generate summary from metadata. The recently proposed AutoChart (Zhu et al., 2021b) dataset is based on some predetermined templates and so this dataset does not contain much variance in the chart descriptions. More recently, Kanthara et al., (Kanthara et al., 2022) proposed the Chart-To-Text dataset that consists of chart images, with their corresponding metadata and human written descriptions. Though this dataset is currently the largest dataset available for this task containing 44,085 charts that were collected from the Statista website and the Pew website, our proposed ChartSumm dataset is almost double in size than the Chart-To-Text dataset.

The ChartSumm Dataset
This section describes how we compile a largescale dataset consisting of 84, 363 examples from the Knoema 2 and the Statista website for the chart to text summarization task along with their analysis.

Dataset Construction
Knoema: It is a statistical service-based online platform that contains the economic indicator of more than 200 countries. Knoema provides a short description for each statistic generated by its digital data assistant named Yodatai 3 that summarizes basic information about datasets. To construct our dataset, we first crawl over 1,10,000 statistics from Knoema. Then, we filter out the statistics where the source of data is not publicly available, resulting in 43,179 publicly available statistics. Afterward, we collect the chart metadata and their corresponding short descriptive captions. Since the statistics in Knoema are shown with respect to the year, we classify each chart as a simple line chart. The title and caption of the chart are then tokenized while we remove white spaces and newlines using stemming. We also normalize the numerical entities.
Statista: It is also an online platform where statistics on a wide range of topics are published along with a short human-written description of the statistics. Topics in Statista include economics, marketing, industry, and opinion research. For dataset creation, at first, we crawl over 750, 000 available pages in Statista research to collect a list of 41, 184 publicly available charts along with summaries and chart metadata. Then we classify the data into simple and complex charts depending on the number of columns in the chart. Similar to the Knoema dataset, we also apply tokenization and stemming. Since many examples in the Statista dataset did not contain the x_label, we apply the following heuristic rules to automatically classify the x_labels as Year, Month, Day, Quarter, Country, City, and Continent: • Year: If all x values were integers less than 2050 and greater than 1800 we set the x label to "year".
• Month: If all x values were names of months, we set the x label to "month".
• Day: If all x values were names of days(saturday, sunday ...), we set the x label to "month".
• Quarter: If most x values were Q1/Q2/Q3/Q4 and optionally followed by an integer(year) we set x label as "quarter".
• Country: If more than 30% x values contained values from the list of all countries collected from Wikipedia, we labeled the x label as "country" • City: If more than 30% x values contained values from the list of cities collected from World City Database, we labeled the x label as "city" • Area: If the x labels contained the names of general areas like continent names, sub continent names, etc, we set the x label as "area".
• NER: We also used named entity recognition to identify some other named types such as companies, social medias etc. For charts where the x_labels could not be automatically determined, the value for the x_label is set to "x_label".
To classify the charts into different types (bar/line/pie), we use ChartReader 4 . The charts are then divided into simple and complex categories. We need to find missing x labels since some of the scrapped data have missing x labels. So we manually identify them using the following methods.

Dataset Analysis
In this section, we analyze our proposed Chart-Summ dataset. At first, we compare our Chart-Summ dataset with some existing datasets in Table  1. We find the both Chart2Text (Obeid and Hoque, 2020) and Chart-To-Text (Kanthara et al., 2022) collected their data from a source called Statista in which the summaries are bit descriptive and longer in length. Whereas our dataset contains both long and short summaries.
Note that our proposed dataset contains line charts, bar charts, and pie charts. For Statista. bar chart is the most common type (64.70% in statista) followed by line chart (33.76% in statista) followed by pie chart (1.54% in statista). For knoema, all charts are line charts. In Figure 2, we show the topic distribution of our dataset. For topic modeling, we perform Latent Dirichlet allocation (LDA) (Blei et al., 2003)    In Table 2, we show the average cell counts for the tables, as well as character and token counts for summaries and titles, respectively. We find that in terms of average number of tokens, the summaries of simple and complex Statista charts were about 35% and 59.9% longer than the Knoema charts, respectively. The data obtained from each source was classified into train, validation, and test sets, following an 80:10:10 split ratio. We show the number of samples in our training, validation, and test sets in table 4. Since, one source of our dataset is Statista that is similar to the Chart-To-Text (Kantharaj et al., 2022b) dataset, we measure the overlaps between the data samples from Statista in our ChartSumm dataset and the Chart-To-Text dataset. For similarity measurement, we first tokenize the captions and then calculate the percentage of matched tokens. We assume two samples are exactly similar when the similarity is greater than 90%. Table 3 shows that only 5, 338 captions in our dataset overlaps with the samples from Statista in Chart-To-Text.

Experiments
In this section, we present the baseline models that we utilize to benchmark the performance in our proposed dataset, followed by the fine-tuning process, the evaluation metrics, and the experimental results.

Baselines:
We use T5-Base (Raffel et al., 2019) and BART (Lewis et al., 2020) as our baselines due to their effectiveness in Chart-To-Text tasks (Kantharaj et al., 2022b). T5 is a large pre-trained language model trained on multiple sequence-to-sequence tasks. BART is a sequence-to-sequence model pretrained for the language modeling task using the denoising autoencoder architecture. We fine-tune three variants of BART: (i) BART-Base, (ii) BART-   Figure 3: Fine-tuning process of baseline models Large-CNN, and (iii) BART-Large-XSUM. We implement all models using HuggingFace (Wolf et al., 2020). Below we describe our model fine-tuning process.

Fine-tuning Process
We used chart metadata (title, corresponding data table, labels) to fine-tune all four of our pre-trained baseline models. We flattened the table by rows and concatenated it with the caption of the table separated by a separator token. To mimic the pretraining process of T5, we added the prefix: "Summarize chart: " before each example. Figure 3 shows the fine-tuning stages. All four baseline models were fine-tuned for 3 epochs with a batch size of 8. The initial learning rate during our fine-tuning was 1e − 6. We used AdamW (Kingma and Ba, 2014; Loshchilov and Hutter, 2018) as our optimizer and cross-entropy as the loss function. We used Google Colab 5 for our experiments.

Evaluation Metrics:
We use five evaluation metrics in our automated evaluation: (i) BLEU (Post, 2018): it uses ngram overlaps between reference text and machinegenerated text to determine similarity score, (ii) BLEURT (Sellam et al., 2020): it evaluates how fluent the candidate is and how well it transfers the reference's meaning (we utilize BLEURT base-128 for our evaluation), (iii) Perplexity: it is a measurement that quantifies how well a probability model predicts a sample (we utilized pre-trained GPT-2 (Radford et al., 2019) to measure perplexity), (iv) CIDEr: (Vedantam et al., 2015) it uses n-grams 5 https://colab.research.google.com/  Comments: Model predicts factually incorrect information and captures incorrect trend.

Fine-tuned on Model Test-S Test-K Chart-To-Text Test-S Test-K Chart-To-Text Test-S Test-K Chart-To-Text Test-S Test-K Chart-To-Text Test-S Test-K Chart-To-Text
Gold: A survey from April 2020 revealed that during the coronavirus (COVID-19) lockdown about three out of ten Italian adults subscribed to paid digital services, with slightly higher penetration rates among male respondents. 20 percent of male adults subscribed to video on demand platforms, while just 14 percent of female respondents did the same.

BART-Base:
This statistic shows the number of paid digital subscriptions during the coronavirus outbreak in Italy in 2020, by user gender.
Comments: Model fails to generate informative summary.
Gold: This statistic depicts the gross profit of Nautilus Inc. worldwide from 2014 to 2020 . In 2020 , Nautilus Inc. had a gross profit amounting to just under 230 million U.S. dollars , which is a significant increase compared to 2019 .
BART-Large-CNN: The statistic shows the gross profit of Nautilus Inc. worldwide from 2014 to 2020 . In 2020 , the company 's global gross profit amounted to approximately 228.8 million U.S. dollars . The company is headquartered in Seattle , Washington , United States .
Comments: Model hallucinated about its headquarter.   Table  5. In terms of BLEURT, CIDER, and CS metrics, we observe that BART-Large models (Bart-Large-CNN and Bart-Large-XSUM) outperform other models in all three test sets, while T5 performs the best in terms of PPL in all test sets. Furthermore, we observe that models fine-tuned on ChartSumm-K always perform better than other models in test sets containing data from Knoema only (except the PPL metric). We again observe similar trends in terms of models fine-tuned on ChartSumm-S. More importantly, all models finetuned on our ChartSumm-S dataset outperform the baselines that are fine-tuned on Chart-To-Text even in the Chart-To-Text test set, showing the effectiveness of our proposed dataset. Meanwhile, models fine-tuned on Chart-To-Text performs very poorly in ChartSumm-K (about 82.48% lower in terms of best performing T5 model), indicating that the Chart-To-Text dataset is not generalizable to generate summaries that are shorter and precise. To further investigate the performance, we do some error analyses in the following section.

Error Analysis and Challenges:
For error analysis, we randomly sampled 100 instances with their summaries generated by different baseline models.
We notice that in many cases, even though the generated summary is fluent and readable, it contains factually incorrect information and predicted wrong trend in data (see the first example in Table  6. We also notice that models sometimes fail to generate informative summaries while also failing to predict anything about data (see the second example in Table 6). In both Bart-Large models, we also find some cases where the models generate information that is fully irrelevant to the chart (i.e., the Hallucination effect (Gong et al., 2019;Obeid and Hoque, 2020;Wiseman et al., 2021)) (see the third example in Table 6).

Evaluation of ChartSumm in Other Languages
To assess the potential for expanding the use of ChartSumm to other languages, we undertook a study where we translated ChartSumm into Bengali and fine-tuned a pre-trained mT5 (Xue et al., 2021) model to evaluate its performance. To our best knowledge, this is the first study that has explored the task of summarizing charts in languages other than English. Translation: To translate the training and validation sets of ChartSumm into Bengali, we utilized NLLB (Costa-jussà et al., 2022), which is a stateof-the-art neural machine translation model. To ensure proper evaluation, the test data was translated into Bengali with the assistance of human annotators who are undergraduate students and have proficiency in both English and Bengali.
Baseline: In our study, we employed mT5 (Xue et al., 2021) as a baseline model due to its efficacy in text summarization tasks in Bengali. We finetuned a variant of mT5 which was pre-trained on multilingual XL-SUM (Hasan et al., 2021). Our baseline model was fine-tuned for 4 epochs using a batch size of 8, with an initial learning rate of 0.000001. The AdamW optimizer was used for the fine-tuning process and cross-entropy was utilized as the loss function.
Evaluation: In our study, we conducted automatic evaluation of the text summarization models using the BLEU, which measures the degree of similarity between the reference and the machinegenerated text using n-gram overlaps. However, we were unable to employ other evaluation metrics such as BLEURT and CIDEr for the Bengali language, as they require language-specific models that are not available for Bengali. Therefore, we only used BLEU as an evaluation metric for our Bengali text summarization experiment. We present the outcomes of our experiments in Ta-ble 7. It is evident from the results that models fine-tuned on ChartSumm-s and ChartSumm-k performed better on their corresponding test sets in comparison to the scenario when evaluated on the combined test set. From this, we can see that the model can perform better in the respective test sets even with machine-generated translations. This opens up the possibility to investigate the performance of ChartSumm in other languages through automatic machine translations. Meanwhile, performance evaluation via human-annotated training data in other languages is also something worth investigating in the future.

Conclusion
In this work, we present a new large scale benchmark dataset for the automatic chart summarization task to address the low resource problem in such tasks. Our proposed dataset is almost double in size than the existing largest dataset available for this task. Thus, the proposed ChartSumm dataset will serve as a strong benchmark for researchers in this relatively new area of natural language generation. We utilize three BART models and one T5 model as baselines and conduct extensive experiments using various evaluation metrics to identify the challenges in this task. Experimental results showed that models fine-tuned with our proposed ChartSumm dataset could achieve better domain generalization than other existing benchmark datasets. We also explored the possibility of extending ChartSumm to other languages through automatic machine translation. In the future, we would like to extend ChartSumm to a multilingual dataset to address the scarcity of well-formatted datasets in other low-resource languages. We will also study how to incorporate query relevance (Laskar et al., 2022d(Laskar et al., , 2020a, question-answering (Masry et al., 2022;Kantharaj et al., 2022a;Hoque et al., 2022;Laskar et al., 2020c), and entity recognition (Laskar et al., 2022c,b,a) capabilities in this task.