Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A significant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-ocurring words only appear in a small number of topics.
Data science is getting huge, exponentially. The basic idea is that every online activity leaves a digital trace, which can be transformed into smart insights. Data scientists have become so important for businesses that Thomas H. Davenport has called it to be the "Sexiest job of 21st century". Various researches have shown that there is an acute shortage of data scientists which will intensify in the future. And combining them with decision scientists, they not only help to produce a working model but also aid businesses to make careful data-driven decisions.
As a relatively new term, "data science" can mean different things to different people due in part to all the hype surrounding the field. Often used in the same breath, we also hear a lot about "big data" and how it is changing the way that companies interact with their customers. This begs the question -- how are these two technologies related? Unfortunately, the hype often masks reality and worsens the Signal-to-noise ratio when it comes to our increasingly data-driven society. Rest assured, there truly is something deep and profound representing a paradigm shift in our society surrounding data, but the hype isn't helping to clarify data science's exact role in Big Data.
Editor's note: This post was originally included as an answer to a question posed in our 17 More Must-Know Data Science Interview Questions and Answers series earlier this year. The answer was thorough enough that it was deemed to deserve its own dedicated post. If you are dealing with many predictor variables, then the chances are high that there are hidden relationships between some of them, leading to redundancy. Unless you identify and handle this redundancy (by selecting only the non-redundant predictor variables) in the early phase of data analysis, it can be a huge drag on your succeeding steps. It is also likely that not all predictor variables are having a considerable impact on the dependent variable(s).
This blog is intended for enterprise data analysts, line of business users, and data practitioners who work with qualitative and quantitative data in decision-making. Even after a model has been tested for production deployment, problems can arise when data models fail to reflect the real-world business model. The difference(s) between a statistical model and the real world may be a result of several factors. Businesses that rely on yesterday's predictive models are likely to produce inaccurate predictions and business decisions that can subject organizations to risks and losses in the event of a disruptive change. Models need timely updates in order to reflect new disruptive changes.