This Web page is aimed at shedding some light on the perennial R-vs.-Python debates in the Data Science community. As a professional computer scientist and statistician, I hope to shed some useful light on the topic. I have potential bias -- I've written 4 R-related books, and currently serve as Editor-in-Chief of the R Journal -- but I hope this analysis will be considered fair and helpful. This is subjective, of course, but having written (and taught) in many different programming languages, I really appreciate Python's greatly reduced use of parentheses and braces: This is of particular interest to me, as an educator. I've taught a number of subjects -- math, stat, CS and even English As a Second Language -- and have given intense thought to the learning process for many, many years.
Python vs. R is a common debate among data scientists, as both languages are useful for data work and among the most frequently mentioned skills in job postings for data science positions. Each language offers different advantages and disadvantages for data science work, and should be chosen depending on the work you are doing. To help data scientists select the right language, Norm Matloff, a professor of computer science at the University of California Davis wrote a Github post aiming to shed some light on the debate. While this is subjective, Python greatly reduces the use of parentheses and braces when coding, making it more sleek, Matloff wrote in the post. While data scientists working with Python must learn a lot of material to get started, including NumPy, Pandas and matplotlib, matrix types and basic graphics are already built into base R, Matloff wrote.
On large numerical datasets, my impression is that Python is faster and more flexible. You will be able to choose long/short floats and trade off storage for accuracy. And, Python has more libraries for medium-sized data management (pytable, dask); these can come in handy. In R, when dealing with large data out of core, my approach has been to rely on standard DBMS (pPostgres, MS server, with which R has native integration, Redshift). As a bonus, dplyr in R offers an excellent, easy interface to these databases.
If you're new to data science, or your organization is, you'll need to pick a language to analyze your data and a thoughtful way to make that decision. Full disclosure: While I can write Python, my background is mostly in the R community--but I'll try my best to be non-partisan. The good news is that you don't need to sweat the decision too hard: both Python and R have vast software ecosystems and communities, so either language is suitable for almost any data science task. The two most commonly used programming language indexes, TIOBE and IEEE Spectrum, rank the most popular programming languages. They use different criteria for popularity, which explains the differences in the results (TIOBE is entirely based on search engine results; IEEE Spectrum also includes community and social media data sources like Stack Overflow, Reddit, and Twitter).
Hi! I'm Jose Portilla and I'm an instructor on Udemy with over 250,000 students enrolled across various courses on Python for Data Science and Machine Learning, R Programming for Data Science, Python for Big Data, and many more. What should I do to become a data scientist? In this post, I'll try my best to help answer this question and point to resources that can help guide you to an answer, also hopefully this post serves as something I can quickly link to my students:) I've broken down the steps into some key topics and discussed helpful details for each. "The secret of getting ahead is getting started." If you are interested in becoming a data scientist the best advice is to begin preparing for your journey now!