If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."
However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …
We all start with either a dataset or a goal in mind. Once we've found, collected or scraped our data, we pull it up, and witness the overwhelming sight of merciless cells of numbers, more numbers, categories, and maybe some words! A naive thought crosses our mind, to use our machine learning prowess to deal with this tangled mess... but a quick search reveals the host of tasks we'll need to consider before training a model! Once we overcome the shock of our unruly data, we look for ways to battle our formidable nemesis. We start by trying to get our data into Python. It is relatively simple on paper, but the process can be slightly... involved. Nonetheless, a little effort was all that was needed (lucky us). Without wasting any time, we begin data cleaning to get rid of the bogus and expose the beautiful.
I had one of those daydreams that come to you from out of nowhere. Before my eyes fell the image of an all-star women's sports team. Like Mario Kart except instead of Mario, Princess Peach, and Toad you have Serena Williams, Lisa Leslie, and Katelyn Ohashi all playing on the same platform. I realized that I could make this vision a reality by using data science and machine learning tools to design the best teams and predict what it would look like if they were to play against each other. However, as with all big dreams (and big data), I decided to start with a subset of the women's sports world and work my way up towards acquiring data from other women's sports.
Machine Learning is 80% preprocessing and 20% model making. You must have heard this phrase if you have ever encountered a senior Kaggle data scientist or machine learning engineer. The fact is that this is a true phrase. In a real-world data science project, data preprocessing is one of the most important things, and it is one of the common factors of success of a model, i.e., if there is correct data preprocessing and feature engineering, that model is more likely to produce noticeably better results as compared to a model for which data is not well preprocessed. There are 4 main important steps for the preprocessing of data.
Elasticsearch is a feature-rich, open-source search-engine built on top of Apache Lucene, one of the most important full-text search engines on the market. Elasticsearch is best known for the expansive and versatile REST API experience it provides, including efficient wrappers for full-text search, sorting and aggregation tasks, making it a lot easier to implement such capabilities in existing backends without the need for complex re-engineering. Ever since its introduction in 2010, Elasticsearch gained a lot of traction in the software engineering domain and by 2016 it became the most popular enterprise search-engine software stack according to DBMS knowledge base DB-engines, surpassing the industry-standard Apache Solr (which is also built on top of Lucene). One of the things that makes Elasticsearch so popular is the ecosystem it generated. Engineers across the world developed open-source Elasticsearch integrations and extensions, and many of these projects were absorbed by Elastic (the company behind the Elasticsearch project) as part of their stack.
Hospital readmission rates for certain conditions such as diabetes are now considered an indicator of hospital quality and also have a negative impact on the cost of care. We used the medical dataset available on the UCI website to find the best models that can help predict the readmission of diabetic patients. The stakeholder in this project will be hospital officials who can use the results to determine which patients have the best chances of readmission. This will save millions of money in the hospital and also improve the quality of health care. The first task you are asked to perform is to build a model of Diabetes Readmission Prediction.
A notebook containing all the relevant code is available on GitHub. Yes, this is a new post among many that address the subject of EDA. This step is the most important of a Data Science project. Because it allows you to acquire knowledge about your data, ideas, and intuitions to be able to model it later. EDA is the art of making your data speak. Being able to control their quality (missing data, wrong types, wrong content …).
Numpy and Pandas are probably the two most widely used core Python libraries for data science (DS) and machine learning (ML)tasks. Needless to say, the speed of evaluating numerical expressions is critically important for these DS/ML tasks and these two libraries do not disappoint in that regard. Under the hood, they use fast and optimized vectorized operations (as much as possible) to speed up the mathematical operations. Plenty of articles have been written about how Numpy is much superior (especially when you can vectorize your calculations) over plain-vanilla Python loops or list-based operations. How Fast Numpy Really is and Why? Data science with Python: Turn your conditional loops to Numpy vectors It pays to even vectorize conditional loops for speeding up the overall data transformation.
But wait… What is Tensorflow? Tensorflow is a Deep Learning Framework by Google, which released its 2nd version in 2019. It is one of the world's most famous Deep Learning frameworks widely used by Industry Specialists and Researchers. Tensorflow v1 was difficult to use and understand as it was less Pythonic, but with v2 released with Keras now fully synchronized with Tensorflow.keras, it is easy to use, easy to learn, and simple to understand. Remember, this is not a post on Deep Learning so I expect you to be aware of Deep Learning terms and the basic ideas behind it.
We all have to deal with data, and we try to learn about and implement machine learning into our projects. But everyone seems to forget one thing... it's far from perfect, and there is so much to go through! Don't worry, we'll discuss every little step, from start to finish . All you'll need are these fundementals We all start with either a dataset or a goal in mind. A naive thought crosses our mind, to use our machine learning prowess to deal with this tangled mess... but a quick search reveals the host of tasks we'll need to consider before training a model! Once we overcome the shock of our unruly data we look for ways to battle our formidable nemesis . It is relatively simple on paper, but the process can be slightly... involved. Nonetheless, a little effort was all that was needed (lucky us).
Spark is an up and coming new big data technology; it's a whole lot faster andeasier than existing Hadoop-based solutions. H2O does state-of-the-art MachineLearning algorithms over Big Data – and does them Fast. We are happy toannounce that H2O now has a basic integration with Spark – Sparkling Water! This is a "deep" integration – H2O can convert Spark RDDs to H2O DataFrames andvice-versa directly. The conversion is fully in-memory and in-process, anddistributed and parallel as well.