Customizing the SentenceDetector in Spark NLP
There are many Natural Language Processing (NLP) tasks that require text to be split in chunks of varying granularity: 1. Document 2. Sentence 3. Token 4. etc… This post is focused on splitting text into sentences in order to facilitate later downstream tasks, such as, Named Entity Recognition (NER), Text Classification or Sentiment Analysis. Splitting a sentence correctly can be crucial for the success of the downstream task as we can see in the following example. Suppose we (wrongly) split a German legal reference like: "Schütze ZPO 4. Aufl. Now you might say this is special subject stuff and there are always exotic cases. But this issue also occurs in daily life when you want to extract common things. Consider, for example, (an invented) German address (with correct syntax for zip code and so forth): "Dr.
Aug-23-2021, 22:35:30 GMT
- Country:
- Europe > Germany
- Bavaria > Upper Bavaria
- Munich (0.06)
- North Rhine-Westphalia > Upper Bavaria
- Munich (0.05)
- Bavaria > Upper Bavaria
- Europe > Germany
- Industry:
- Law (0.56)
- Technology: