Introducing the VRDU Dataset: A Breakthrough in Document Understanding
In recent years, there has been significant progress in developing systems that process complex business documents automatically. These systems can turn documents like receipts, insurance quotes, and financial statements into structured objects, which has the potential to revolutionize business workflows by eliminating manual work and reducing errors.
However, the current models used in academic literature do not accurately capture the challenges faced in real-world applications. While they perform well on academic benchmarks, they struggle to perform accurately on complex real-world tasks.
To bridge this gap, we are excited to announce the release of the Visually Rich Document Understanding (VRDU) dataset. This dataset aims to provide researchers with a benchmark that better reflects the complexities of document understanding tasks in real-world scenarios.
The VRDU dataset meets five key requirements for an effective document understanding benchmark:
1. Rich Schema: Real-world documents require structured extraction of different data types, such as numbers, dates, and strings. The benchmark should reflect this complexity, rather than using simple schemas.
2. Layout-Rich Documents: Real-world documents often contain tables, key-value pairs, footnotes, and varying font sizes. The benchmark should include these layout elements to accurately simulate practical challenges.
3. Diverse Templates: A good benchmark should include different structural layouts or templates to measure a model’s ability to generalize to new layouts, rather than memorizing specific structures.
4. High-Quality OCR: The benchmark should use high-quality Optical Character Recognition (OCR) results to focus on the document understanding task, rather than variations in OCR engines.
5. Token-Level Annotation: Ground-truth annotations should map back to the corresponding input text on a token level. This ensures clean training data without incidental matches to given values, reducing noise in the examples.
The VRDU dataset combines two publicly available datasets, Registration Forms and Ad-Buy forms, to create a representative sample of real-world use cases. These datasets include various types of documents, such as political advertisement details and registration forms for foreign agents.
We trained models on different training sets containing 10, 50, 100, and 200 samples and evaluated their performance on three tasks: Single Template Learning, Mixed Template Learning, and Unseen Template Learning. The F1 score was used to measure model performance on the test set.
The VRDU dataset provides a more realistic and challenging benchmark for document understanding models. It surpasses existing academic benchmarks, such as FUNSD, CORD, SROIE, Kleister-NDA, Kleister-Charity, and DeepForm, by meeting all five benchmark requirements.
We believe that the VRDU dataset will greatly enhance research progress in document understanding tasks and we are excited to release it to the public under a Creative Commons license. For more details on the dataset and evaluation results, please refer to the paper.
Try the VRDU dataset today and take your document understanding models to the next level!