Yale-LILY/dart_数据集

Yale-LILY/dart

hugging_face2022-11-18 更新2024-05-25 收录

文本生成

数据转换

资源简介：

--- annotations_creators: - crowdsourced - machine-generated language_creators: - crowdsourced - machine-generated language: - en license: - mit multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended|wikitable_questions - extended|wikisql - extended|web_nlg - extended|cleaned_e2e task_categories: - tabular-to-text task_ids: - rdf-to-text paperswithcode_id: dart pretty_name: DART dataset_info: features: - name: tripleset sequence: sequence: string - name: subtree_was_extended dtype: bool - name: annotations sequence: - name: source dtype: string - name: text dtype: string splits: - name: train num_bytes: 12966443 num_examples: 30526 - name: validation num_bytes: 1458106 num_examples: 2768 - name: test num_bytes: 2657644 num_examples: 5097 download_size: 29939366 dataset_size: 17082193 --- # Dataset Card for DART ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [homepahe](https://github.com/Yale-LILY/dart) - **Repository:** [github](https://github.com/Yale-LILY/dart) - **Paper:** [paper](https://arxiv.org/abs/2007.02871) - **Leaderboard:** [leaderboard](https://github.com/Yale-LILY/dart#leaderboard) ### Dataset Summary DART is a large dataset for open-domain structured data record to text generation. We consider the structured data record input as a set of RDF entity-relation triples, a format widely used for knowledge representation and semantics description. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set. This hierarchical, structured format with its open-domain nature differentiates DART from other existing table-to-text corpora. ### Supported Tasks and Leaderboards The task associated to DART is text generation from data records that are RDF triplets: - `rdf-to-text`: The dataset can be used to train a model for text generation from RDF triplets, which consists in generating textual description of structured data. Success on this task is typically measured by achieving a *high* [BLEU](https://huggingface.co/metrics/bleu), [METEOR](https://huggingface.co/metrics/meteor), [BLEURT](https://huggingface.co/metrics/bleurt), [TER](https://huggingface.co/metrics/ter), [MoverScore](https://huggingface.co/metrics/mover_score), and [BERTScore](https://huggingface.co/metrics/bert_score). The ([BART-large model](https://huggingface.co/facebook/bart-large) from [BART](https://huggingface.co/transformers/model_doc/bart.html)) model currently achieves the following scores: | | BLEU | METEOR | TER | MoverScore | BERTScore | BLEURT | | ----- | ----- | ------ | ---- | ----------- | ---------- | ------ | | BART | 37.06 | 0.36 | 0.57 | 0.44 | 0.92 | 0.22 | This task has an active leaderboard which can be found [here](https://github.com/Yale-LILY/dart#leaderboard) and ranks models based on the above metrics while also reporting. ### Languages The dataset is in english (en). ## Dataset Structure ### Data Instances Here is an example from the dataset: ``` {'annotations': {'source': ['WikiTableQuestions_mturk'], 'text': ['First Clearing\tbased on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville']}, 'subtree_was_extended': False, 'tripleset': [['First Clearing', 'LOCATION', 'On NYS 52 1 Mi. Youngsville'], ['On NYS 52 1 Mi. Youngsville', 'CITY_OR_TOWN', 'Callicoon, New York']]} ``` It contains one annotation where the textual description is 'First Clearing\tbased on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville'. The RDF triplets considered to generate this description are in tripleset and are formatted as subject, predicate, object. ### Data Fields The different fields are: - `annotations`: - `text`: list of text descriptions of the triplets - `source`: list of sources of the RDF triplets (WikiTable, e2e, etc.) - `subtree_was_extended`: boolean, if the subtree condidered during the dataset construction was extended. Sometimes this field is missing, and therefore set to `None` - `tripleset`: RDF triplets as a list of triplets of strings (subject, predicate, object) ### Data Splits There are three splits, train, validation and test: | | train | validation | test | | ----- |------:|-----------:|-----:| | N. Examples | 30526 | 2768 | 6959 | ## Dataset Creation ### Curation Rationale Automatically generating textual descriptions from structured data inputs is crucial to improving the accessibility of knowledge bases to lay users. ### Source Data DART comes from existing datasets that cover a variety of different domains while allowing to build a tree ontology and form RDF triple sets as semantic representations. The datasets used are WikiTableQuestions, WikiSQL, WebNLG and Cleaned E2E. #### Initial Data Collection and Normalization DART is constructed using multiple complementary methods: (1) human annotation on open-domain Wikipedia tables from WikiTableQuestions (Pasupat and Liang, 2015) and WikiSQL (Zhong et al., 2017), (2) automatic conversion of questions in WikiSQL to declarative sentences, and (3) incorporation of existing datasets including WebNLG 2017 (Gardent et al., 2017a,b; Shimorina and Gardent, 2018) and Cleaned E2E (Novikova et al., 2017b; Dušek et al., 2018, 2019) #### Who are the source language producers? [More Information Needed] ### Annotations DART is constructed using multiple complementary methods: (1) human annotation on open-domain Wikipedia tables from WikiTableQuestions (Pasupat and Liang, 2015) and WikiSQL (Zhong et al., 2017), (2) automatic conversion of questions in WikiSQL to declarative sentences, and (3) incorporation of existing datasets including WebNLG 2017 (Gardent et al., 2017a,b; Shimorina and Gardent, 2018) and Cleaned E2E (Novikova et al., 2017b; Dušek et al., 2018, 2019) #### Annotation process The two stage annotation process for constructing tripleset sentence pairs is based on a tree-structured ontology of each table. First, internal skilled annotators denote the parent column for each column header. Then, a larger number of annotators provide a sentential description of an automatically-chosen subset of table cells in a row. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Under MIT license (see [here](https://github.com/Yale-LILY/dart/blob/master/LICENSE)) ### Citation Information ``` @article{radev2020dart, title={DART: Open-Domain Structured Data Record to Text Generation}, author={Dragomir Radev and Rui Zhang and Amrit Rau and Abhinand Sivaprasad and Chiachun Hsieh and Nazneen Fatema Rajani and Xiangru Tang and Aadit Vyas and Neha Verma and Pranav Krishna and Yangxiaokang Liu and Nadia Irwanto and Jessica Pan and Faiaz Rahman and Ahmad Zaidi and Murori Mutuma and Yasin Tarabar and Ankit Gupta and Tao Yu and Yi Chern Tan and Xi Victoria Lin and Caiming Xiong and Richard Socher}, journal={arXiv preprint arXiv:2007.02871}, year={2020} ``` ### Contributions Thanks to [@lhoestq](https://github.com/lhoestq) for adding this dataset.

原始地址：

https://hf-mirror.com/datasets/Yale-LILY/dart

提供机构：

Yale-LILY

数据集概述

名称: DART

语言: 英语 (en)

许可: MIT

多语言性: 单语

大小类别: 10K<n<100K

源数据集:

扩展自 WikiTableQuestions
扩展自 WikiSQL
扩展自 WebNLG
扩展自 Cleaned E2E

任务类别: 表格到文本

任务ID: rdf-to-text

数据集信息:

特征:
- tripleset: 字符串序列，表示RDF三元组。
- subtree_was_extended: 布尔类型，表示子树是否被扩展。
- annotations: 序列，包含：
  - source: 字符串，源数据。
  - text: 字符串，文本描述。
数据分割:
- train: 30526个样本，12966443字节。
- validation: 2768个样本，1458106字节。
- test: 5097个样本，2657644字节。
下载大小: 29939366字节
数据集大小: 17082193字节

数据集创建

来源数据:

WikiTableQuestions
WikiSQL
WebNLG
Cleaned E2E

注释过程:

使用树结构的本体对每个表格进行注释。
首先由内部熟练的注释者标记每个列标题的父列。
然后，更多的注释者提供表格行中自动选择的单元格的句子描述。

许可证信息:

MIT许可证，详情见此处。

引用信息:

@article{radev2020dart, title={DART: Open-Domain Structured Data Record to Text Generation}, author={Dragomir Radev and Rui Zhang and Amrit Rau and Abhinand Sivaprasad and Chiachun Hsieh and Nazneen Fatema Rajani and Xiangru Tang and Aadit Vyas and Neha Verma and Pranav Krishna and Yangxiaokang Liu and Nadia Irwanto and Jessica Pan and Faiaz Rahman and Ahmad Zaidi and Murori Mutuma and Yasin Tarabar and Ankit Gupta and Tao Yu and Yi Chern Tan and Xi Victoria Lin and Caiming Xiong and Richard Socher}, journal={arXiv preprint arXiv:2007.02871}, year={2020} }

数据集介绍

构建方式

DART数据集的构建，主要依托于对开放域知识库中的表格数据进行结构化处理，将表格数据转换为RDF三元组形式，并结合树状本体结构进行标注。该数据集的构建采用了人工标注与自动转换相结合的方式，首先通过内部熟练标注者对每个列头进行父列标注，随后由更多标注者为自动选择的表格行中的单元格提供句子描述，以此形成了三元组与句子描述的对。构建过程中，数据来源于WikiTableQuestions、WikiSQL、WebNLG和Cleaned E2E等现有数据集，经过整合和规范化处理，最终形成了包含82191个示例的DART数据集。

特点

DART数据集的特点在于其开放域性质，以及所采用的三元组语义表示形式。该数据集覆盖了不同领域的知识，每一输入数据均为从表格记录和树状本体结构中派生出的语义RDF三元组集，并伴有覆盖所有三元组事实的句子描述。这种分层的结构化格式，使得DART与其他表格到文本的语料库区别开来，为文本生成任务提供了丰富的语义基础。

使用方法

使用DART数据集时，用户可以通过其提供的训练集、验证集和测试集进行模型的训练和评估。数据集的字段包括文本描述、数据源、是否扩展子树以及RDF三元组。用户可以根据具体任务需求，利用这些字段进行文本生成模型的构建，并通过数据集中的指标（如BLEU、METEOR等）对模型性能进行量化评估。此外，数据集的MIT许可证为研究者和开发者提供了较为宽松的使用权限。

背景与挑战

背景概述

DART数据集，全称为Dataset for Automated Textual Representation of structured data records to text，是由Yale大学LILY实验室的研究团队于2020年创建的。该数据集旨在解决开放领域中结构化数据记录向文本自动生成的任务，其核心研究问题是如何将RDF三元组形式的结构化数据转换为自然语言描述。DART数据集的构建基于多个不同领域的现有数据集，包括WikiTableQuestions、WikiSQL、WebNLG和Cleaned E2E，通过人工标注和自动转换相结合的方式生成，共包含82,191个示例。该数据集在自然语言处理领域，特别是在结构化数据到文本生成任务中具有重要的影响力，为相关研究提供了丰富的实验资源。

当前挑战

在构建DART数据集的过程中，研究人员面临了多个挑战。首先，如何确保从结构化数据到文本描述的转换能够准确无误地传达所有的事实信息。其次，构建过程中需要处理不同来源和格式的数据，这些数据的整合和规范化是一个挑战。此外，数据集的多样性和覆盖性也是需要克服的问题，以确保模型能够适应各种不同的数据记录格式和内容。在研究领域问题方面，DART数据集挑战的核心在于提高生成文本的自然性、准确性和一致性，同时还需要解决如何有效评估生成文本质量的问题。

常用场景

经典使用场景

在知识图谱与自然语言处理领域，DART数据集的应用显得尤为重要。该数据集通过将RDF三元组转化为自然语言文本，为模型训练提供了丰富的学习材料，其经典的使用场景在于训练生成式模型，以实现对结构化数据的语义描述。

实际应用

在实际应用中，DART数据集可以被用于开发智能问答系统、自动化报告生成工具以及辅助数据分析师快速理解数据模式。它能够帮助构建更加人性化的用户界面，使得非专业人士也能轻松地与复杂的数据结构进行交互。

衍生相关工作

基于DART数据集，研究者们已经衍生出多项相关工作，包括但不限于改进文本生成模型、探索数据到文本转换的新方法以及评估不同模型在结构化数据描述任务上的表现。这些研究进一步拓宽了DART数据集的应用范围，推动了自然语言处理技术的进步。

以上内容由AI搜集并总结生成