DFKI-SLT/tacred

hugging_face2024-10-16 更新2024-03-04 收录

关系抽取

知识库构建

资源简介：

--- annotations_creators: - crowdsourced - expert-generated language: - en language_creators: - found license: - other multilinguality: - monolingual pretty_name: The TAC Relation Extraction Dataset, TACRED Revisited and Re-TACRED size_categories: - 100K<n<1M source_datasets: - extended|other tags: - relation extraction task_categories: - text-classification task_ids: - multi-class-classification --- # Dataset Card for "TACRED" ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://nlp.stanford.edu/projects/tacred](https://nlp.stanford.edu/projects/tacred) - **Paper:** [Position-aware Attention and Supervised Data Improve Slot Filling](https://aclanthology.org/D17-1004/) - **Point of Contact:** See [https://nlp.stanford.edu/projects/tacred/](https://nlp.stanford.edu/projects/tacred/) - **Size of downloaded dataset files:** 62.3 MB - **Size of the generated dataset:** 139.2 MB - **Total amount of disk used:** 201.5 MB ### Dataset Summary The TAC Relation Extraction Dataset (TACRED) is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members) or are labeled as no_relation if no defined relation is held. These examples are created by combining available human annotations from the TAC KBP challenges and crowdsourcing. Please see [Stanford's EMNLP paper](https://nlp.stanford.edu/pubs/zhang2017tacred.pdf), or their [EMNLP slides](https://nlp.stanford.edu/projects/tacred/files/position-emnlp2017.pdf) for full details. To use the dataset reader, you need to obtain the data from the Linguistic Data Consortium: https://catalog.ldc.upenn.edu/LDC2018T24. Note: - There is currently a [label-corrected version](https://github.com/DFKI-NLP/tacrev) of the TACRED dataset, which you should consider using instead of the original version released in 2017. For more details on this new version, see the [TACRED Revisited paper](https://aclanthology.org/2020.acl-main.142/) published at ACL 2020. - There is also a [relabeled and pruned version](https://github.com/gstoica27/Re-TACRED) of the TACRED dataset. For more details on this new version, see the [Re-TACRED paper](https://arxiv.org/abs/2104.08398) published at AAAI 2021. This repository provides all three versions of the dataset as BuilderConfigs - `'original'`, `'revisited'` and `'re-tacred'`. Simply set the `name` parameter in the `load_dataset` method in order to choose a specific version. The original TACRED is loaded per default. ### Supported Tasks and Leaderboards - **Tasks:** Relation Classification - **Leaderboards:** [https://paperswithcode.com/sota/relation-extraction-on-tacred](https://paperswithcode.com/sota/relation-extraction-on-tacred) ### Languages The language in the dataset is English. ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 62.3 MB - **Size of the generated dataset:** 139.2 MB - **Total amount of disk used:** 201.5 MB An example of 'train' looks as follows: ```json { "id": "61b3a5c8c9a882dcfcd2", "docid": "AFP_ENG_20070218.0019.LDC2009T13", "relation": "org:founded_by", "token": ["Tom", "Thabane", "resigned", "in", "October", "last", "year", "to", "form", "the", "All", "Basotho", "Convention", "-LRB-", "ABC", "-RRB-", ",", "crossing", "the", "floor", "with", "17", "members", "of", "parliament", ",", "causing", "constitutional", "monarch", "King", "Letsie", "III", "to", "dissolve", "parliament", "and", "call", "the", "snap", "election", "."], "subj_start": 10, "subj_end": 13, "obj_start": 0, "obj_end": 2, "subj_type": "ORGANIZATION", "obj_type": "PERSON", "stanford_pos": ["NNP", "NNP", "VBD", "IN", "NNP", "JJ", "NN", "TO", "VB", "DT", "DT", "NNP", "NNP", "-LRB-", "NNP", "-RRB-", ",", "VBG", "DT", "NN", "IN", "CD", "NNS", "IN", "NN", ",", "VBG", "JJ", "NN", "NNP", "NNP", "NNP", "TO", "VB", "NN", "CC", "VB", "DT", "NN", "NN", "."], "stanford_ner": ["PERSON", "PERSON", "O", "O", "DATE", "DATE", "DATE", "O", "O", "O", "O", "O", "O", "O", "ORGANIZATION", "O", "O", "O", "O", "O", "O", "NUMBER", "O", "O", "O", "O", "O", "O", "O", "O", "PERSON", "PERSON", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "stanford_head": [2, 3, 0, 5, 3, 7, 3, 9, 3, 13, 13, 13, 9, 15, 13, 15, 3, 3, 20, 18, 23, 23, 18, 25, 23, 3, 3, 32, 32, 32, 32, 27, 34, 27, 34, 34, 34, 40, 40, 37, 3], "stanford_deprel": ["compound", "nsubj", "ROOT", "case", "nmod", "amod", "nmod:tmod", "mark", "xcomp", "det", "compound", "compound", "dobj", "punct", "appos", "punct", "punct", "xcomp", "det", "dobj", "case", "nummod", "nmod", "case", "nmod", "punct", "xcomp", "amod", "compound", "compound", "compound", "dobj", "mark", "xcomp", "dobj", "cc", "conj", "det", "compound", "dobj", "punct"] } ``` ### Data Fields The data fields are the same among all splits. - `id`: the instance id of this sentence, a `string` feature. - `docid`: the TAC KBP document id of this sentence, a `string` feature. - `token`: the list of tokens of this sentence, obtained with the StanfordNLP toolkit, a `list` of `string` features. - `relation`: the relation label of this instance, a `string` classification label. - `subj_start`: the 0-based index of the start token of the relation subject mention, an `ìnt` feature. - `subj_end`: the 0-based index of the end token of the relation subject mention, exclusive, an `ìnt` feature. - `subj_type`: the NER type of the subject mention, among 23 fine-grained types used in the [Stanford NER system](https://stanfordnlp.github.io/CoreNLP/ner.html), a `string` feature. - `obj_start`: the 0-based index of the start token of the relation object mention, an `ìnt` feature. - `obj_end`: the 0-based index of the end token of the relation object mention, exclusive, an `ìnt` feature. - `obj_type`: the NER type of the object mention, among 23 fine-grained types used in the [Stanford NER system](https://stanfordnlp.github.io/CoreNLP/ner.html), a `string` feature. - `stanford_pos`: the part-of-speech tag per token. the NER type of the subject mention, among 23 fine-grained types used in the [Stanford NER system](https://stanfordnlp.github.io/CoreNLP/ner.html), a `list` of `string` features. - `stanford_ner`: the NER tags of tokens (IO-Scheme), among 23 fine-grained types used in the [Stanford NER system](https://stanfordnlp.github.io/CoreNLP/ner.html), a `list` of `string` features. - `stanford_deprel`: the Stanford dependency relation tag per token, a `list` of `string` features. - `stanford_head`: the head (source) token index (0-based) for the dependency relation per token. The root token has a head index of -1, a `list` of `int` features. ### Data Splits To miminize dataset bias, TACRED is stratified across years in which the TAC KBP challenge was run: | | Train | Dev | Test | | ----- | ------ | ----- | ---- | | TACRED | 68,124 (TAC KBP 2009-2012) | 22,631 (TAC KBP 2013) | 15,509 (TAC KBP 2014) | | Re-TACRED | 58,465 (TAC KBP 2009-2012) | 19,584 (TAC KBP 2013) | 13,418 (TAC KBP 2014) | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process See the Stanford paper and the Tacred Revisited paper, plus their appendices. To ensure that models trained on TACRED are not biased towards predicting false positives on real-world text, all sampled sentences where no relation was found between the mention pairs were fully annotated to be negative examples. As a result, 79.5% of the examples are labeled as no_relation. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information To respect the copyright of the underlying TAC KBP corpus, TACRED is released via the Linguistic Data Consortium ([LDC License](https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf)). You can download TACRED from the [LDC TACRED webpage](https://catalog.ldc.upenn.edu/LDC2018T24). If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed. ### Citation Information The original dataset: ``` @inproceedings{zhang2017tacred, author = {Zhang, Yuhao and Zhong, Victor and Chen, Danqi and Angeli, Gabor and Manning, Christopher D.}, booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)}, title = {Position-aware Attention and Supervised Data Improve Slot Filling}, url = {https://nlp.stanford.edu/pubs/zhang2017tacred.pdf}, pages = {35--45}, year = {2017} } ``` For the revised version (`"revisited"`), please also cite: ``` @inproceedings{alt-etal-2020-tacred, title = "{TACRED} Revisited: A Thorough Evaluation of the {TACRED} Relation Extraction Task", author = "Alt, Christoph and Gabryszak, Aleksandra and Hennig, Leonhard", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics",https://catalog.ldc.upenn.edu/LDC2018T24 url = "https://www.aclweb.org/anthology/2020.acl-main.142", doi = "10.18653/v1/2020.acl-main.142", pages = "1558--1569", } ``` For the relabeled version (`"re-tacred"`), please also cite: ``` @inproceedings{DBLP:conf/aaai/StoicaPP21, author = {George Stoica and Emmanouil Antonios Platanios and Barnab{\'{a}}s P{\'{o}}czos}, title = {Re-TACRED: Addressing Shortcomings of the {TACRED} Dataset}, booktitle = {Thirty-Fifth {AAAI} Conference on Artificial Intelligence, {AAAI} 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, {IAAI} 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, {EAAI} 2021, Virtual Event, February 2-9, 2021}, pages = {13843--13850}, publisher = {{AAAI} Press}, year = {2021}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/17631}, } ``` ### Contributions Thanks to [@dfki-nlp](https://github.com/dfki-nlp) and [@phucdev](https://github.com/phucdev) for adding this dataset.

原始地址：

https://hf-mirror.com/datasets/DFKI-SLT/tacred

提供机构：

DFKI-SLT

数据集概述

数据集名称

名称: The TAC Relation Extraction Dataset, TACRED Revisited and Re-TACRED
别名: tacred

数据集基本信息

语言: 英语
许可证: 其他
多语言性: 单语
大小类别: 100K<n<1M
源数据集: 扩展自其他数据集
标签创建者: 众包和专家生成
任务类别: 文本分类
任务ID: 多类分类

数据集详细描述

概述: TACRED是一个大规模的关系抽取数据集，包含106,264个例子，构建于TAC知识库填充挑战所使用的年份新闻和网络文本之上。数据集涵盖41种关系类型，或标记为no_relation。
版本: 数据集提供三个版本：原始版本、TACRED Revisited和Re-TACRED。

数据集结构

数据实例: 每个实例包括多个字段，如ID、文档ID、关系标签、实体位置等。
数据字段: 包括id、docid、token、relation等，详细描述了文本的结构和内容。
数据分割: 数据集根据TAC KBP挑战的年份进行分割，以减少偏差。

数据集创建

注释过程: 结合TAC KBP挑战的人工注释和众包注释，确保模型训练不偏向于预测错误正例。
许可证: 数据集通过Linguistic Data Consortium发布，需要LDC许可证。

使用数据集的注意事项

许可证: 使用数据集需遵守LDC许可证，LDC成员可免费访问，非成员需支付$25。
引用信息: 引用原始数据集和修订版本时，需遵循相应的引用格式。

数据集版本信息

原始版本: 由Zhang等人在2017年发布。
TACRED Revisited: 由Alt等人在2020年发布，提供了标签修正。
Re-TACRED: 由Stoica等人在2021年发布，提供了重新标记和修剪的版本。

数据集介绍

构建方式

TACRED数据集的构建基于新闻电讯和网络文本，结合了TAC KBP挑战中可用的人类注释和众包技术。该数据集包含106,264个示例，涵盖了41种关系类型，或标记为无关系。构建过程中，特别注重对无关系样本的充分标注，以确保模型不会对现实世界文本中的假阳性产生偏见。

特点

TACRED数据集的特点在于其大规模的样本量、多样化的关系类型以及结合了专家和众包的注释方式。数据集采用分层划分，以减少数据偏差，并提供了原始、修订和重标记三个版本，以适应不同的研究需求。此外，数据集以英语为语言，遵循LDC许可。

使用方法

使用TACRED数据集时，用户需从Linguistic Data Consortium获取数据。数据集支持关系分类任务，并可通过HuggingFace的load_dataset方法加载，选择特定版本（'original'、'revisited'或're-tacred'）进行使用。用户在获取和使用数据时应遵守相应的版权和许可协议。

背景与挑战

背景概述

TACRED数据集，全称为The TAC Relation Extraction Dataset，是自然语言处理领域中关系提取任务的一个重要数据集。该数据集创建于2017年，由斯坦福大学的研究团队开发，旨在为关系提取任务提供大规模的标注数据。TACRED基于每年举办的TAC KBP挑战赛的语料库构建，包含了106,264个实例，涵盖了41种关系类型。数据集结合了来自TAC KBP挑战赛的人类标注和众包方式生成的标注，对自然语言处理领域的研究具有重要的推动作用。

当前挑战

在TACRED数据集的研究与使用过程中，面临的挑战主要包括：1) 数据集中关系类型的多样性和复杂性，为关系分类任务带来了挑战；2) 数据集构建过程中，如何确保标注质量，减少众包方式可能引入的误差；3) 数据集可能存在的偏差和局限性，例如，数据集中79.5%的例子被标记为无关系，可能导致模型对真实世界文本中的关系预测过于保守；4) 数据集的规模和结构可能对某些应用场景下的模型训练和推理效率产生影响。

常用场景

经典使用场景

在自然语言处理领域，TACRED数据集的经典使用场景主要在于关系提取任务，该任务旨在识别文本中实体之间的语义关系。TACRED以其丰富的关系类型和标注质量，为研究者提供了一个理想的训练和评估平台，使得模型能够准确捕捉如人物组织关系、地理位置关系等多种类型的信息。

衍生相关工作

基于TACRED数据集，学术界衍生出了一系列相关工作，包括对数据集本身的改进，如TACRED Revisited和Re-TACRED，它们分别对原始数据集进行了重新评估和标签修正，提高了数据集的质量。此外，还有许多研究工作利用TACRED数据集在关系提取任务上取得了显著成果，推动了关系提取领域的发展。

数据集最近研究