DocOIE Dataset
DocOIE is a document-level context-aware dataset for Open Information Extraction. DocOIE consists of
evaluation and training dataset. Evaluation dataset contains 800 expert-annotated sentences, sampled from 80
documents in two domains (healthcare and transportation). Training dataset contains 2,400 documents from the
two domains (healthcare and transportation); 1,200 documents in each domain. All sentences from these
documents are used to bootstrap pseudo labels for neural model training. Note: Only document IDs are
included in DocOIE Training dataset, for document collection at PatFT. Dataset download and more details at
Github:
https://github.com/daviddongkc/DocOIE Please
refer to the following paper for more detailed description of the dataset.
- Kuicai Dong, Yilin Zhao, Aixin Sun, Jung-Jae Kim, Xiaoli Li;
DocOIE: A Document-level Context-Aware Dataset for OpenIE.
ACL'21 Findings [
PDF@ACL Anthology].
HSpam14 Dataset
The HSpam14 dataset described in our SIGIR 2015 paper is available at links offered
by
Dropbox
and/or
Microsoft OneDrive. The dataset in compressed format HSpam14_dataset.zip is about 74.7MB and the
uncompressed
version HSpam14_dataset.txt is about 308MB.
The text file contains three columns: tweet_id, label, and step (see sample data on the right). The
label field
contains one of three values {0, 1, -1} where: 0 is for ham (or non-spam tweet), 1 is for spam tweet, and -1
is for
those tweets that can not be labeled as spam or ham even after manual inspection. The step field contains
values from 1
to 6, describing at which step the tweet was labeled during the labeling process:
1 => Manual annotation
2 => kNN-based annotation
3 => User-based annotation
4 => Domain-based annotation
5 => Reliable ham tweet detection
6 => EM-based annotation
Please refer to the following paper for more detailed description of the dataset.
- Surendra Sedhai and Aixin Sun.
HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam
Research. SIGIR '15 [
PDF ].
Wikipedia Keyphraseness
This dataset contains a collection of keyphraseness values for phrases
extracted from
Wikipedia articles. The keyphraseness value
Q(s) of a phrase
s is the probability that the phrase appears in a Wikipedia article
as being
anchor text. In total, 4,342,732 phrases are extracted from the English Wikipedia dump created on January
30, 2010. In
this release, we remove the 184,979 phrases containing non-English characters. Among the remaining 4,157,753
phrases,
about 1.9 million phrases have non-zero keyphraseness values. This dataset contains one text file and a
readme file in zip format (about 45MB in size). The dataset can be
downloaded through
links by
Dropbox or
Microsoft OneDrive .
This dataset has been used in the following 3 papers. Please refer to the papers for more details
about the
dataset and how the keyphraseness values can be used in various tasks (All papers can be downloaded freely
from ACM
digital library using the links below). This dataset is released solely for research purposes. Please cite
at least one
of the following 3 papers if you use this dataset in your research.
- Chenliang Li, Aixin Sun, Jianshu Weng, Qi He.
Exploiting hybrid contexts for Tweet segmentation. SIGIR 2013 [
PDF,
ACM Link,
Bibtex]
- Chenliang Li, Aixin Sun, Anwitaman Datta.
Twevent: segment-based event detection from tweets. CIKM 2012 [
PDF,
ACM Link,
Bibtex]
- Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, Bu-Sung Lee.
TwiNER: named entity recognition in targeted twitter stream
SIGIR 2012. [
PDF,
ACM Link
Bibtex]
Tag Visual-Representativeness
The data set used for computing the Image Tag Clarity is available
online at
NUS-Wide. The
normalized image tag clarity scores for the 5981 most popular tags are
available in
Excel format. Note that, the values reported here might be slightly different from the values reported in
the WSM'09
paper due to the different number of dummy tags used for estimating the expected image clarity scores. The
mean/std of
image clarity scores for a given frequency reported here are estimated through 500 dummy tags (see MM'10
paper). Please
cite the following papers if you use the above results in your work (e.g., filtering tags by visual
concepts, visual
representativeness, or others).
Tag labels used in MM'10 paper experiments are available in Excel
format. Please
drop me an email at
axsun AT ntu DOT edu DOT sg if you have any comments
regarding the
paper or the experimental results.
- Aixin Sun, Sourav S. Bhowmick.
Quantifying Tag Representativeness of Visual Content of Social
Images. In
Proc. of ACM Multimedia (
MM'10), Pages 471-480. Firenze, Italy. Oct 2010. [
PDF,
BibTex].
- Aixin Sun and Sourav S. Bhowmick.
Image Tag Clarity: In Search of Visual-Representative Tags for Social
Images.
In Proc. of the 1st ACM SIGMM Workshop on Social Media (WSM09) in
conj. with ACM
MM, Pages 19-26. Beijing, China. Oct 2009. [
PDF,
BibTex]
Comments-Oriented Document Summarization
Blog Summarization Dataset used in SIGIR08 paper is
available
here. Please refer to the following paper for detailed
description of the
dataset and cite the paper if you use the dataset.
- Meishan Hu, Aixin Sun, and Ee-Peng Lim.
Comments-Oriented Document Summarization: Understanding Documents with
Readers'
Feedback.
In Proc. of 31st Annual International ACM SIGIR Conference on Research
and
Development on Information Retrieval (SIGIR08). Pages 291 -- 298. Singapore. July 2008. [
PDF
,
BibTex]
- Meishan Hu, Aixin Sun, and Ee-Peng Lim.
Comments-Oriented Blog Summarization by Sentence Extraction.
In Proc. of ACM Conference on Information and Knowledge Management
(CIKM07).
Pages 901-904. Lisboa, Portugal, Nov, 2007. [
PDF,
BibTex]
Web Unit Mining (UnitSet)
The
UnitSet is the one used for Web Unit Mining project. The dataset is
created based
on the WebKB dataset which is available at
Web->KB project. Please cite any of the following two
papers if you
would like to use
UnitSet in your experiments:
- Aixin Sun and Ee-Peng Lim,
Web Unit Based Mining of Homepage Relationships,
Journal of the American Society for Information Science and Technology
(JASIST), 57(3):394-407. February 2006. [
PDF,
BibTex]
- Aixin Sun and Ee-Peng Lim,
Web Unit Mining: Finding and Classifying Subgraphs of Web Pages.
In
Proc. of 12th ACM International Conference on Information and
Knowledge Management
(CIKM 2003), pp. 108-115, New Orleans, LA, USA, Nov. 2003. [
PDF,
BibTex]