Repository logo
 

CatalogBank: a structured and interoperable catalog dataset with a semi-automatic annotation tool (DocumentLabeler) for engineering system design

dc.contributor.authorBank, Hasan Sinan, author
dc.contributor.authorHerber, Daniel R., author
dc.contributor.authorACM, publisher
dc.date.accessioned2024-12-17T19:15:15Z
dc.date.available2024-12-17T19:15:15Z
dc.date.issued2024-08-20
dc.description.abstractIn the realm of document engineering and Natural Language Processing (NLP), the integration of digitally born catalogs into product design processes presents a novel avenue for enhancing information extraction and interoperability. This paper introduces CatalogBank, a dataset developed to bridge the gap between textual descriptions and other data modalities related to engineering design catalogs. We utilized existing information extraction methodologies to extract product information from PDF-based catalogs to use in downstream tasks to generate a baseline metric. Our approach not only supports the potential automation of design workflows but also overcomes the limitations of manual data entry and non-standard metadata structures that have historically impeded the seamless integration of textual and other data modalities. Through the use of DocumentLabeler, an open-source annotation tool adapted for our dataset, we demonstrated the potential of CatalogBank in supporting diverse document-based tasks such as layout analysis and knowledge extraction. Our findings suggest that CatalogBank can contribute to document engineering and NLP by providing a robust dataset for training models capable of understanding and processing complex document formats with relatively less effort using the semi-automated annotation tool DocumentLabeler.
dc.format.mediumborn digital
dc.format.mediumarticles
dc.identifier.bibliographicCitationHasan Sinan Bank and Daniel R. Herber. 2024. CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi-Automatic Annotation Tool (DocumentLabeler) for Engineering System Design. In ACM Symposium on Document Engineering 2024 (DocEng ’24), August 20–23, 2024, San Jose, CA, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3685650.3685665
dc.identifier.doihttps://doi.org/10.1145/3685650.3685665
dc.identifier.urihttps://hdl.handle.net/10217/239734
dc.languageEnglish
dc.language.isoeng
dc.publisherColorado State University. Libraries
dc.relation.ispartofPublications
dc.relation.ispartofACM DL Digital Library
dc.rights©Hasan Sinan Bank, et al. ACM 2024. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in DocEng '24, https://dx.doi.org/10.1145/3685650.3685665.
dc.subjectdocument engineering
dc.subjectannotation
dc.subjectinformation extraction
dc.subjectdocument dataset
dc.titleCatalogBank: a structured and interoperable catalog dataset with a semi-automatic annotation tool (DocumentLabeler) for engineering system design
dc.typeText

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
FACF_ACMOA_3685650.3685665.pdf
Size:
10.8 MB
Format:
Adobe Portable Document Format

Collections