Data Quality


Text and Data Quality Mining in CRIS

#artificialintelligence

Different research institutions use research information for different purposes. Data analyses and reports based on current research information systems (CRIS) provide information about the research activities and their results. As a rule, management and controlling utilize the research information from the CRIS for reporting. For example, trend analysis helps with business strategy decisions or rapid ad-hoc analysis to respond effectively to short-term moves. Ultimately, the analysis results and the resulting interpretations and decisions depend directly on the quality of the data.


Basic Data Cleaning for Machine Learning (That You Must Perform)

#artificialintelligence

Data cleaning is a critically important step in any machine learning project. In tabular data, there are many different statistical analysis and data visualization techniques you can use to explore your data in order to identify data cleaning operations you may want to perform. Before jumping to the sophisticated methods, there are some very basic data cleaning operations that you probably should perform on every single machine learning project. These are so basic that they are often overlooked by seasoned machine learning practitioners, yet are so critical that if skipped, models may break or report overly optimistic performance results. In this tutorial, you will discover basic data cleaning you should always perform on your dataset.


Learning from Bad Data via Generation

Neural Information Processing Systems

Bad training data would challenge the learning model from understanding the underlying data-generating scheme, which then increases the difficulty in achieving satisfactory performance on unseen test data. We suppose the real data distribution lies in a distribution set supported by the empirical distribution of bad data. A worst-case formulation can be developed over this distribution set, and then be interpreted as a generation task in an adversarial manner. The connections and differences between GANs and our framework have been thoroughly discussed. We further theoretically show the influence of this generation task on learning from bad data and reveal its connection with a data-dependent regularization.


Data Cleansing for Models Trained with SGD

Neural Information Processing Systems

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm that can identify influential instances without using any domain knowledge. The proposed algorithm automatically cleans the data, which does not require any of the users' knowledge. Hence, even non-experts can improve the models. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning.


Nonconvex Low-Rank Tensor Completion from Noisy Data

Neural Information Processing Systems

We study a completion problem of broad practical interest: the reconstruction of a low-rank symmetric tensor from highly incomplete and randomly corrupted observations of its entries. While a variety of prior work has been dedicated to this problem, prior algorithms either are computationally too expensive for large-scale applications, or come with sub-optimal statistical guarantees. Focusing on incoherent'' and well-conditioned tensors of a constant CP rank, we propose a two-stage nonconvex algorithm --- (vanilla) gradient descent following a rough initialization --- that achieves the best of both worlds. Specifically, the proposed nonconvex algorithm faithfully completes the tensor and retrieves all low-rank tensor factors within nearly linear time, while at the same time enjoying near-optimal statistical guarantees (i.e. The insights conveyed through our analysis of nonconvex optimization might have implications for other tensor estimation problems.


Service Objects Leverages Artificial Intelligence (AI) to Offer the Mo

#artificialintelligence

Service Objects, the leading provider of real-time global contact validation solutions, announced it is delivering enhanced results for contact data validation by coupling artificial intelligence (AI) capabilities with its extensive network of over 300 data sources. This combination makes these services the most complete and accurate contact validation APIs available today. Service Objects' APIs allow customers to validate global contact information within their software platforms. These services can verify a contacts' name, global address, phone, email address and device simultaneously against hundreds of authoritative data sources, all in less than a second. Service Objects' services work with a process of adaptive machine learning that continually improves their capabilities, leveraging the results of previous transactions.


Bad Data Equals Bad Predictive Model

#artificialintelligence

Data is key to any data science and machine learning task. Data comes in different flavors such as numerical data, categorical data, text data, image data, sound data, and video data. The predictive power of a model depends on the quality of data used in building the model. Whatever the source of your data, it's important that you understand how the data was collected. For example, data collected from surveys may contain lots of missing data, and false information.


Artificial intelligence is making artificial intelligence easier to build ZDNet

#artificialintelligence

Artificial intelligence and machine learning will automate many business and life tasks, from driving trucks to piloting ships to handling customer calls -- and actually carrying on rudimentary chats with them. What's not discussed often enough, however, is the actually impact on the jobs of AI creators and administrators themselves -- developers, analysts, and data administrators and everyone else in the information technology orbit charged with building out these revolutionary systems. In essence, AI will play a role in helping to smooth out the rough spots of AI development. IT and data professionals have much to gain from the AI revolution. I recently had the chance to explore some of the possibilities with leading industry observers, who see the roles of IT managers and professionals being elevated to greater business responsibilities as a result of being relieved much of the grunt work of AI.


How to use Suggestions in SAS Data Studio

#artificialintelligence

With the release of SAS Viya 3.5, you now have a Suggestions feature in SAS Data Studio. The Suggestions feature uses machine learning models to analyze your data and suggest transforms based on the type of data found in your data set. You can add the suggested transforms to your SAS Data Studio plan, allowing you to address data quality issues with your data set. Before you begin using the Suggestions feature in SAS Data Studio you must first register the models used by the service. A default Models caslib is provided for your default CAS server during the installation process.


Get started with the Data Asset eXchange

#artificialintelligence

The IBM Data Asset eXchange (DAX) is an online hub for developers and data scientists to find free and open data sets under open data licenses. A particular focus of the exchange is data sets under the Community Data License Agreement (CDLA). For developers, DAX offers a trusted source for open data sets for artificial intelligence (AI). These data sets are ready to use in enterprise AI applications and are supplemented with relevant notebooks and tutorials. Also, DAX offers unique access to various IBM and IBM Research data sets and offers various integrations with IBM Cloud and AI services.