Extraction of (Key,Value) Pairs from Unstructured Ads
Chakraborty, Sunandan (New York University) | Subramanian, Lakshminarayanan (New York University) | Nyarko, Yaw (New York University)
In this paper, we focus on the problem of extracting structured labeled data from short unstructured ad-postings from online sources like Craigslist, where ads are posted on various topics, such as job postings, rentals, car sales etc. A fundamental challenge in addressing this problem is that most ad-postings are highly unstructured, short-text postings written in an informal manner with no inherent grammar or well-defined dictionary. In this paper, we propose unsupervised and supervised algorithms for extracting structured data from unstructured ads in the form of (key, value) pairs where the keys naturally represent topic-specific features in the ads. The unsupervised algorithm is centered around building an affinity graph, using the words from a topic-specific corpus of such ads where the edge weights represent affinities between words; the (key, value) extraction algorithm identifies specific groups of words in the affinity graph corresponding to different classes of key attributes. The supervised algorithm uses a Conditional Random Field based training algorithm to identify specific structured (key, value) pairs based on pre-defined topic-specific structural data representations of ads. Based on a corpus of car and apartment ad-postings from Craigslist, the unsupervised algorithm reported an accuracy of 67.74% and 68.74% for car and apartment ads respectively. The supervised algorithm demonstrated an improved performance with accuracies of 74.07% and 72.59% respectively.
Nov-1-2014
- Country:
- North America > United States
- California > San Francisco County
- San Francisco (0.14)
- New York (0.14)
- California > San Francisco County
- North America > United States
- Industry:
- Automobiles & Trucks (1.00)
- Technology: