Text Processing

AMD launches its Epyc server chip to take on Intel in the data center


It's not just the folks at AMD who hope that that the company's Epyc server processor, officially launched Tuesday, will break Intel's stranglehold on the data-center chip market. Enterprise users, web hosting companies and hyperscale cloud providers all want competition and choice in server chips to curb costs and fuel innovation. At the high end, in approximately the $4,000 range, AMD internal benchmarks show the Epyc 7601 single-socket package offering 75 percent higher floating point performance (for spreadsheets, graphics and games, for instance) and 47 percent higher integer processing performance (for whole-number and text processing, for example) than Intel's E5-2699A v4. Interestingly, AMD benchmarks show 70 percent higher integer performance over Intel in the mid-range, $800 price point level, with the Epyc 7301 facing off against the Intel E5-7630.

Machine Learning - the robots are coming and they are in neat downloadable often opensource packages.


Beyond the glamorous and somewhat glitzy background of Machine Learning there are huge new complex sets of machine languages, packages, toolkit, libraries and methods that are making the Machine Learning revolution necessary. OpenCV - The Open Source Computer Vision Library - Is an open source computer vision and machine learning software library. MALLET - MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. MLib - MLlib is Apache Spark's scalable machine learning library.Ease of Use Usable in Java, Scala, Python, and R. Weka3 Data Mining Software in Java - Weka is a collection of machine learning algorithms for data mining tasks.

Cross-Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization


Computerized cross-language plagiarism detection has recently become essential. With the scarcity of scientific publications in Bahasa Indonesia, many Indonesian authors frequently consult publications in English in order to boost the quantity of scientific publications in Bahasa Indonesia (which is currently rising). Due to the syntax disparity between Bahasa Indonesia and English, most of the existing methods for automated cross-language plagiarism detection do not provide satisfactory results. The results of the experiments showed that the best accuracy achieved is 87% with a document size of 6 words, and the document definition size must be kept below 10 words in order to maintain high accuracy.

Apple Announces Core ML: Machine Learning Capabilities on Apple Devices


At WWDC 2017 Apple announced ways it uses machine learning, and ways for developers to add machine learning to their own applications. Their machine learning API, called Core ML, allows developers to integrate machine learning models into apps running on Apple devices with iOS, macOS, watchOS, and tvOS. Natural language processing API calls include language identification, tokenization, part of speech tag extraction, and recognizing named entities. It is not possible to import an existing Tensorflow model into Core ML, which would be possible with Tensorflow Lite on Android.

Defence of the Doctoral Dissertation: Machine Learning of Semantics for Text Understanding


We propose two rule-based approaches for mapping text into predicate logic. This led us to develop a grammar induction approach for semantic parsing and ontology learning. The induced context-free grammar parses a sentence of text into a semantic tree, which is a meaning representation, where each node has its own semantic category, e.g. To evaluate the models, we propose a new metric -- the accuracy of the classifier trained on the generated dataset and tested on the original, manually constructed dataset.

Conditional Random Fields (CRF): Short Survey


For example, some Indian researchers used CRF to extract key words from medical texts and they had good features and large enough training sample, but they obtained quality not more than 0.4 (F1-measure). On real data they would hardly obtain such quality, while Stanford NER shows quality not more than 0.81 (F-measure) given it has perfectly selected training features and it was trained on larger corpora (CoNLL, MUC-6, MUC-7 and ACE) Some Spanish and Russian researchers compared HMM and CRF in NER task for medical texts on JNLPBA corpus (18546 sentences with 109588 named entities). They obtained interesting results: HMM had higher recall ( 4-7% depending on the type of entity) while CRF had higher precision ( 4-13% depending on the type of entity). According to one master thesis, linear-chain CRF operated very well on extracting time expressions from Russian text.

Trump's legal team apparently doesn't know how to use spell-check


All across the country, middle school English teachers are crying. After James Comey's jaw-dropping testimony today, President Trump's legal team responded by dropping a few jaws of their own. SEE ALSO: Wow, James Comey's breakup playlist is really powerful "I am Marc Kasowtiz, Predisent Trump's personal lawyer," the statement reads. In February, the White House released a list of alleged terrorist attacks and misspelled the word attacker as "attaker" 27 times.

Text Analysis for Social Media Cybersecurity: the AMiCA Project


The text analysis part of the AMiCA project (http://www.amicaproject.be), a cooperation between the University of Antwerp and the University of Ghent, developed methods and software to help moderators detect occurrences of unwanted or dangerous situations in their social networks. More specifically, the project developed prototype systems for the detection of cyberbullying, suicide announcements, and sexually transgressive behavior. In this talk I will focus on the text analysis methods that were used for normalization of social media text, for profiling users, and for detecting dangerous content. I will describe the architectures and results of the three resulting applications.

Data with Relationships Yields Insights Before Analytics


Creating relationships between data enables "intelligent inferencing": –in other words-data learning from data. Semantic graphs effectively link all data in a uniform fashion with an emphasis on the edges–the links between nodes, a critical determinant for ascertaining relationships that transcends the capabilities of non-semantic property graphs focused solely on the nodes themselves. The purple boxes show the similarity between products, the grey boxes show the prices for these products. But the reality is that an organization's ability to effectively determine the relationships between its data is essential to monetize data, whether big data or otherwise, structured or unstructured, within or across data sets.

Machine Learning - Apple Developer


Core ML lets you integrate a broad variety of machine learning model types into your app. You can run machine learning models on the device so data doesn't need to leave the device to be analyzed. Supported features include face tracking, face detection, landmarks, text detection, rectangle detection, barcode detection, object tracking, and image registration. Use trained machine learning models to deeply understand text using features such as language identification, tokenization, lemmatization, part of speech, and named entity recognition.