CatalogBank: a structured and interoperable catalog dataset with a semi-automatic annotation tool (DocumentLabeler) for engineering system design
dc.contributor.author | Bank, Hasan Sinan, author | |
dc.contributor.author | Herber, Daniel R., author | |
dc.contributor.author | ACM, publisher | |
dc.date.accessioned | 2024-12-17T19:15:15Z | |
dc.date.available | 2024-12-17T19:15:15Z | |
dc.date.issued | 2024-08-20 | |
dc.description.abstract | In the realm of document engineering and Natural Language Processing (NLP), the integration of digitally born catalogs into product design processes presents a novel avenue for enhancing information extraction and interoperability. This paper introduces CatalogBank, a dataset developed to bridge the gap between textual descriptions and other data modalities related to engineering design catalogs. We utilized existing information extraction methodologies to extract product information from PDF-based catalogs to use in downstream tasks to generate a baseline metric. Our approach not only supports the potential automation of design workflows but also overcomes the limitations of manual data entry and non-standard metadata structures that have historically impeded the seamless integration of textual and other data modalities. Through the use of DocumentLabeler, an open-source annotation tool adapted for our dataset, we demonstrated the potential of CatalogBank in supporting diverse document-based tasks such as layout analysis and knowledge extraction. Our findings suggest that CatalogBank can contribute to document engineering and NLP by providing a robust dataset for training models capable of understanding and processing complex document formats with relatively less effort using the semi-automated annotation tool DocumentLabeler. | |
dc.format.medium | born digital | |
dc.format.medium | articles | |
dc.identifier.bibliographicCitation | Hasan Sinan Bank and Daniel R. Herber. 2024. CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi-Automatic Annotation Tool (DocumentLabeler) for Engineering System Design. In ACM Symposium on Document Engineering 2024 (DocEng ’24), August 20–23, 2024, San Jose, CA, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3685650.3685665 | |
dc.identifier.doi | https://doi.org/10.1145/3685650.3685665 | |
dc.identifier.uri | https://hdl.handle.net/10217/239734 | |
dc.language | English | |
dc.language.iso | eng | |
dc.publisher | Colorado State University. Libraries | |
dc.relation.ispartof | Publications | |
dc.relation.ispartof | ACM DL Digital Library | |
dc.rights | ©Hasan Sinan Bank, et al. ACM 2024. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in DocEng '24, https://dx.doi.org/10.1145/3685650.3685665. | |
dc.subject | document engineering | |
dc.subject | annotation | |
dc.subject | information extraction | |
dc.subject | document dataset | |
dc.title | CatalogBank: a structured and interoperable catalog dataset with a semi-automatic annotation tool (DocumentLabeler) for engineering system design | |
dc.type | Text |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- FACF_ACMOA_3685650.3685665.pdf
- Size:
- 10.8 MB
- Format:
- Adobe Portable Document Format