Must-Know: How to determine the most useful number of clusters?


Editor's note: This post was originally included as an answer to a question posed in our 17 More Must-Know Data Science Interview Questions and Answers series earlier this year. The answer was thorough enough that it was deemed to deserve its own dedicated post. With supervised learning, the number...

Word to Vectors -- Natural Language Processing – Towards Data Science


Why is Natural Language Processing difficult? Computers interact with humans in programming languages which are unambiguous, precise and often structured. However, natural (human) language has a lot of ambiguity. There are multiple words with same meaning (synonyms), words with multiple meanings (p...

Introduction to Recommender System. Part 1 (Collaborative Filtering, Singular Value Decomposition)


The first one analyzes the nature of each item. For instance, recommending poets to a user by performing Natural Language Processing on the content of each poet. Collaborative Filtering, on the other hand, does not require any information about the items or the users themselves. It recommends items ...

Interpretation Result of K means Algorithm


In clustering, the goal is usually to get high similarity within each group, and low similarity between each group. Let's translate it to statistical terms: high similarity within a group low variance within the cluster, or within_SS. Now, let's say you compute all the variance in the data, and call it total_SS. In optimal clustering, since the clusters are very different from each other, then most of the total variance is explained by the variance between the groups. And of course, since the variance within each group is very small, it would explain only a small fraction of the total variance in the data.

Study: Distinct brain rhythms and regions help us reason about categories

MIT News

We categorize pretty much everything we see, and remarkably, we often achieve that feat whether the items look patently similar -- such as Fuji and McIntosh apples -- or they share a more abstract similarity -- such as a screwdriver and a drill. A new study at MIT's Picower Institute for Learning and Memory explains how. "Categorization is a fundamental cognitive mechanism," says Earl Miller, the Picower Professor in MIT's Picower Institute for Learning and Memory and the Department of Brain and Cognitive Sciences. "It's the way the brain learns to generalize. If your brain didn't have this ability, you'd be overwhelmed by details of the sensory world.

Information retrieval document search using vector space model in R


Note, there are many variations in the way we calculate the term-frequency(tf) and inverse document frequency (idf), in this post we have seen one variation. Below images show as the other recommended variations of tf and idf, taken from wiki. Mathematically, closeness between two vectors is calculated by calculating the cosine angle between two vectors. In similar lines, we can calculate cosine angle between each document vector and the query vector to find its closeness. To find relevant document to the query term, we may calculate the similarity score between each document vector and the query term vector by applying cosine similarity .

An introduction to machine-learned ranking in Apache Solr


This tutorial describes how to implement a modern learning to rank (LTR, also called machine-learned ranking) system in Apache Solr. It's intended for people who have zero Solr experience, but who are comfortable with machine learning and information retrieval concepts. I was one of those people only a couple of months ago, and I found it extremely challenging to get up and running with the Solr materials I found online. This is my attempt at writing the tutorial I wish I had when I was getting started. Firing up a vanilla Solr instance on Linux (Fedora, in my case) is actually pretty straightforward.

Looking beyond accuracy to improve trust in machine learning - codecentric AG Blog


A general Data Science workflow in machine learning consists of the following steps: gather data, clean and prepare data, train models and choose the best model based on validation and test errors or other performance criteria. Usually we – particularly we Data Scientists or Statisticians who live for numbers, like small errors and high accuracy – tend to stop at this point. Let's say we found a model that predicted 99% of our test cases correctly. In and of itself, that is a very good performance and we tend to happily present this model to colleagues, team leaders, decision makers or whoever else might be interested in our great model. We assume that our model is trustworthy, because we have seen it perform well, but we don't know why it performed well. by jamesmullenbach


Over the break between semesters, I've spent a lot of time with family playing a popular board game called Codenames. If you haven't played, the gist is that one player from each team, the'spymaster', tries to get their team members to select their team's assigned words from a group of 25 while avoiding the other team's words and a game-ending'assassin' word, using one word clues. It's like the game show Password, except clues can apply to any number of words. It's a fun language based game and makes for an interesting testbed for simple experiments like the one I'm about to talk about. Naturally, I thought about how a computer might play this game.