Goto

Collaborating Authors

 textract


TalentMine: LLM-Based Extraction and Question-Answering from Multimodal Talent Tables

Mannam, Varun, Wang, Fang, Liu, Chaochun, Chen, Xin

arXiv.org Artificial Intelligence

In talent management systems, critical information often resides in complex tabular formats, presenting significant retrieval challenges for conventional language models. These challenges are pronounced when processing Talent documentation that requires precise interpretation of tabular relationships for accurate information retrieval and downstream decision-making. Current table extraction methods struggle with semantic understanding, resulting in poor performance when integrated into retrieval-augmented chat applications. This paper identifies a key bottleneck - while structural table information can be extracted, the semantic relationships between tabular elements are lost, causing downstream query failures. To address this, we introduce TalentMine, a novel LLM-enhanced framework that transforms extracted tables into semantically enriched representations. Unlike conventional approaches relying on CSV or text linearization, our method employs specialized multimodal reasoning to preserve both structural and semantic dimensions of tabular data. Experimental evaluation across employee benefits document collections demonstrates TalentMine's superior performance, achieving 100% accuracy in query answering tasks compared to 0% for standard AWS Textract extraction and 40% for AWS Textract Visual Q&A capabilities. Our comparative analysis also reveals that the Claude v3 Haiku model achieves optimal performance for talent management applications. The key contributions of this work include (1) a systematic analysis of semantic information loss in current table extraction pipelines, (2) a novel LLM-based method for semantically enriched table representation, (3) an efficient integration framework for retrieval-augmented systems as end-to-end systems, and (4) comprehensive benchmarks on talent analytics tasks showing substantial improvements across multiple categories.


Abstract2Appendix: Academic Reviews Enhance LLM Long-Context Capabilities

Li, Shengzhi, Kampa, Kittipat, Lin, Rongyu, Li, Bohang, Pei, Shichao

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown remarkable performance across various tasks, yet their ability to handle long-context reading remains challenging. This study explores the effectiveness of leveraging high-quality academic peer review data for fine-tuning LLMs to enhance their long-context capabilities. We compare the Direct Preference Optimization (DPO) method with the Supervised Fine-Tuning (SFT) method, demonstrating DPO's superiority and data efficiency. Our experiments show that the fine-tuned model achieves a 4.04-point improvement over phi-3 and a 2.6\% increase on the Qasper benchmark using only 2000 samples. Despite facing limitations in data scale and processing costs, this study underscores the potential of DPO and high-quality data in advancing LLM performance. Additionally, the zero-shot benchmark results indicate that aggregated high-quality human reviews are overwhelmingly preferred over LLM-generated responses, even for the most capable models like GPT-4o. This suggests that high-quality human reviews are extremely rich in information, reasoning, and long-context retrieval, capabilities that even the most advanced models have not fully captured. These findings highlight the high utility of leveraging human reviews to further advance the field.


Making Intelligent Document Processing Smarter: Part 1 - KDnuggets

#artificialintelligence

As seen from the table, these cleaning methods do not work on all the images and in fact sometimes the API performance degrades after applying these cleaning methods. Hence there is a need for a unified solution which can work on all kinds of noises. After testing various datasets including Noisy Office, Smart Doc QA, SROIE and custom datasets to compare and evaluate the performance of Tesseract, Vision and Textract, we can conclude that the OCR output gets affected by the noises present in the documents. The inbuilt denoiser or pre-processor is not sufficient to handle most of the noises including motion blur, watermark etc. If the document images are denoised, the OCR output can improve significantly.


2022H2 Amazon Textract launch summary

#artificialintelligence

Documents are a primary tool for record keeping, communication, collaboration, and transactions across many industries, including financial, medical, legal, and real estate. The millions of mortgage applications and hundreds of millions of W2 tax forms processed each year are just a few examples of such documents. Critical business data remains unlocked in unstructured documents such as scanned images and PDFs, and trying to get humans to read this data or even legacy OCR is tedious, expensive, and error prone. This is why we launched Amazon Textract in 2019 to help you automate your tedious document processing workflows powered by AI. Amazon Textract automatically extracts printed text, handwriting, and data from any document.


Build a traceable, custom, multi-format document parsing pipeline with Amazon Textract

#artificialintelligence

Organizational forms serve as a primary business tool across industries--from financial services, to healthcare, and more. Consider, for example, tax filing forms in the tax management industry, where new forms come out each year with largely the same information. AWS customers across sectors need to process and store information in forms as part of their daily business practice. These forms often serve as a primary means for information to flow into an organization where technological means of data capture are impractical. In addition to using forms to capture information, over the years of offering Amazon Textract, we have observed that AWS customers frequently version their organizational forms based on structural changes made, fields added or changed, or other considerations such as a change of year or version of the form.


Announcing support for extracting data from identity documents using Amazon Textract

#artificialintelligence

Creating efficiencies in your business is at the top of your list. You want your employees to be more productive, have them focus on high impact tasks, or find ways to implement better processes to improve the outcomes to your customers. There are various ways to solve this problem, and more companies are turning to artificial intelligence (AI) and machine learning (ML) to help. In the financial services sector, there is the creation of new accounts online, or in healthcare there are new digital platforms to schedule and manage appointments, which require users to fill out forms. These can be error prone, time consuming, and certainly improved upon.


A Complete Guide To The Machine Learning Tools On AWS

#artificialintelligence

With a solution to almost every machine learning problem, Amazon Machine Learning offers a rich set of tools for machine learning engineers to work with. Amazon also adds new services every few months based on new use cases, making it one of the most dependable platforms for engineers to build AI solutions for their customers. Hope you enjoyed the article. If you have any questions, let me know in the comments. You can also signup for my newsletter to receive a summary of articles once a week.


AWS says Amazon Textract is now HIPAA-eligible

#artificialintelligence

In another big move aimed at its healthcare clients, Amazon Web Services revealed this week that its Textract machine learning technology – which can help healthcare organizations more easily extract data from scanned documents – is now HIPAA eligible, joining a half-dozen other cloud-based AI tools. WHY IT MATTERS It's important to note that HIPAA-eligible is not the same as HIPAA-compliant – it just means that the technology is able to be customized and put to use in ways that are. Organizations can't simply install Textract and expect to be compliant. That said, with proper configurations, the tool can be deployed at healthcare and life science organizations whose tasks that require HIPAA compliance, said Textract Product Lead Kriti Bharti in an Oct. 10 blog post. "Critical healthcare information often lies within documents such as medical records and forms. Healthcare and life science organizations need to access data that is locked inside those documents in order to fulfil medical claims, streamline administrative processes, and process electronic health records," said Bharti.


Tech Diaries: What is all the fuss about Deepfakes? - Medium

#artificialintelligence

The main story of this edition of the Tech Diaries is the Deepfakes issue that has gotten the U.S Congress freaking out. It represents the class of synthetic media generated by AI and represents another dark side of technology -- ringing alarm bells about what the implications of a sudden digital transformation can have on the society as a whole. The disruption caused by deepfakes can have serious consequences on how we differentiate right from wrong -- as if the "fake news" issue on the social media & other platforms isn't enough headache already. U.S lawmakers have started hearings on the issue, fearing the disruptive & deceptive technology may unfairly affect the upcoming U.S Presidential elections in 2020 -- as we saw, how just a simple low tech manipulation of videos of the U.S President & the House Speaker by rival groups earlier this year created headlines. The real problem starts when advanced Deep Learning algorithms are employed to create real-life images.


AWS launches Textract, machine learning for text and data extraction

#artificialintelligence

Need to extract content from a document quickly and automatically? Amazon today announced the general availability of Textract, a cloud-hosted and fully managed service that uses machine learning to parse data tables, forms, and whole pages for text and data. Virginia), US West (Oregon), and EU (Ireland) regions and will expand to additional regions in the coming year. Textract is more capable than your average optical character recognition system. From files stored in an Amazon S3 bucket, it's able to suss out the contents of fields and tables and the context in which this information is presented, like names and social security numbers in tax forms or totals from photographed receipts.