Customizing The SentenceDetector In Spark NLP - AI Summary

Aug-24-2021, 22:58:22 GMT–#artificialintelligence

There are many Natural Language Processing (NLP) tasks that require text to be split in chunks of varying granularity: Making a task to extract names and addresses of a person is almost impossible under these conditions – just because the data preparation stage was not up for it. Subject specific technical terms are sometimes abbreviated in a way that is otherwise, generally not used: (German Legal References: "Putzo ZPO 39. So, lets take the German legal reference example from above and apply Spark NLPs extended capabilities on a sample project (with a series of CoLab notebooks) to see how this will help us splitting text correctly into sentences. And make the first 1000 rulings available as a separate JSON file (since handling a larger data collections is otherwise difficult with a normal CoLab license). I developed a command line tool called unsplit to parse the text from the German legal court rulings to split sentences at a period, except when the period character was at one of the known abbreviations in the previously curated list (the unsplit tool is a C#/.Net command line program which I can publish on GitHub if people are interested). But honestly, I use this as hint towards the quality of a model and tend to say "the truth is in the pudding" and trust the a real world test more than any KPIs: I'll be looking forward on comments about things that could be improved in the data preparation stage of this sentence detection modelling task or other items you might find worth giving me feedback about. Making a task to extract names and addresses of a person is almost impossible under these conditions – just because the data preparation stage was not up for it. Subject specific technical terms are sometimes abbreviated in a way that is otherwise, generally not used: (German Legal References: "Putzo ZPO 39.

abbreviation, data preparation stage, spark nlp, (12 more...)

#artificialintelligence

Aug-24-2021, 22:58:22 GMT

News Web Page

Add feedback

Country:
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.07)

Industry:
- Law (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found