Regression
FETILDA: An Effective Framework For Fin-tuned Embeddings For Long Financial Text Documents
Xia, Bolun "Namir", Rawte, Vipula D., Zaki, Mohammed J., Gupta, Aparna
Unstructured data, especially text, continues to grow rapidly in various domains. In particular, in the financial sphere, there is a wealth of accumulated unstructured financial data, such as the textual disclosure documents that companies submit on a regular basis to regulatory agencies, such as the Securities and Exchange Commission (SEC). These documents are typically very long and tend to contain valuable soft information about a company's performance. It is therefore of great interest to learn predictive models from these long textual documents, especially for forecasting numerical key performance indicators (KPIs). Whereas there has been a great progress in pre-trained language models (LMs) that learn from tremendously large corpora of textual data, they still struggle in terms of effective representations for long documents. Our work fills this critical need, namely how to develop better models to extract useful information from long textual documents and learn effective features that can leverage the soft financial and risk information for text regression (prediction) tasks. In this paper, we propose and implement a deep learning framework that splits long documents into chunks and utilizes pre-trained LMs to process and aggregate the chunks into vector representations, followed by self-attention to extract valuable document-level features. We evaluate our model on a collection of 10-K public disclosure reports from US banks, and another dataset of reports submitted by US companies. Overall, our framework outperforms strong baseline methods for textual modeling as well as a baseline regression model using only numerical data. Our work provides better insights into how utilizing pre-trained domain-specific and fine-tuned long-input LMs in representing long documents can improve the quality of representation of textual data, and therefore, help in improving predictive analyses.
Machine Learning-Driven Process of Alumina Ceramics Laser Machining
Behbahani, Razyeh, Sarvestani, Hamidreza Yazdani, Fatehi, Erfan, Kiyani, Elham, Ashrafi, Behnam, Karttunen, Mikko, Rahmat, Meysam
Laser machining is a highly flexible non-contact manufacturing technique that has been employed widely across academia and industry. Due to nonlinear interactions between light and matter, simulation methods are extremely crucial, as they help enhance the machining quality by offering comprehension of the inter-relationships between the laser processing parameters. On the other hand, experimental processing parameter optimization recommends a systematic, and consequently time-consuming, investigation over the available processing parameter space. An intelligent strategy is to employ machine learning (ML) techniques to capture the relationship between picosecond laser machining parameters for finding proper parameter combinations to create the desired cuts on industrial-grade alumina ceramic with deep, smooth and defect-free patterns. Laser parameters such as beam amplitude and frequency, scanner passing speed and the number of passes over the surface, as well as the vertical distance of the scanner from the sample surface, are used for predicting the depth, top width, and bottom width of the engraved channels using ML models. Owing to the complex correlation between laser parameters, it is shown that Neural Networks (NN) are the most efficient in predicting the outputs. Equipped with an ML model that captures the interconnection between laser parameters and the engraved channel dimensions, one can predict the required input parameters to achieve a target channel geometry. This strategy significantly reduces the cost and effort of experimental laser machining during the development phase, without compromising accuracy or performance. The developed techniques can be applied to a wide range of ceramic laser machining processes.
Machine learning method for return direction forecasting of Exchange Traded Funds using classification and regression models
Piovezan, Raphael P. B., Junior, Pedro Paulo de Andrade
This article aims to propose and apply a machine learning method to analyze the direction of returns from Exchange Traded Funds (ETFs) using the historical return data of its components, helping to make investment strategy decisions through a trading algorithm. In methodological terms, regression and classification models were applied, using standard datasets from Brazilian and American markets, in addition to algorithmic error metrics. In terms of research results, they were analyzed and compared to those of the Na\"ive forecast and the returns obtained by the buy & hold technique in the same period of time. In terms of risk and return, the models mostly performed better than the control metrics, with emphasis on the linear regression model and the classification models by logistic regression, support vector machine (using the LinearSVC model), Gaussian Naive Bayes and K-Nearest Neighbors, where in certain datasets the returns exceeded by two times and the Sharpe ratio by up to four times those of the buy & hold control model.
Understanding Linear Regression
Linear regression is a regression model which outputs a numeric value. It is used to predict an outcome based on a linear set of input. As you can guess this function represents a linear line in the coordinate system. The hypothesis function (h0) approximates the output given input. A linear regression model can either represent a univariate or a multivariate problem.
Machine Learning: Classification
In our case study on analyzing sentiment, you will create models that predict a class (positive/negative sentiment) from input features (text of the reviews, user profile information,...). In our second case study for this course, loan default prediction, you will tackle financial data, and predict when a loan is likely to be risky or safe for the bank. These tasks are an examples of classification, one of the most widely used areas of machine learning, with a broad array of applications, including ad targeting, spam detection, medical diagnosis and image classification. In this course, you will create classifiers that provide state-of-the-art performance on a variety of tasks. You will become familiar with the most successful techniques, which are most widely used in practice, including logistic regression, decision trees and boosting.
How to use logistic regression for image classification?
Image Classification is a process of classifying various image categories to their appropriate labels or categories it is associated with. Image classification is mostly employed with Convolutional Neural Networks (CNNs), but this article is an attempt to showcase that even logistic regression has the capability to classify images efficiently with a reduction in computational time and also to waive off the tedious task of building complex models for image classification. Logistic Regression is one of the supervised machine learning algorithms which would be majorly employed for binary class classification problems where according to the occurrence of a particular category of data the outcomes are fixed. Logistic regression operates basically through a sigmoidal function for values ranging between 0 and 1. As mentioned earlier as this article emphasizes using Logistic Regression for Image classification we are using the Hand Sign Digit Classification dataset with two categories of images showing Hand Signs of 0 and 1.
Where Do Loss Functions Come From?
We all know that in Linear Regression we aim to minimise the Sum of Squares Error (SSE) as our objective. However, why is it the SSE and where does this expression even come from? In this article I hope to answer this question using something called the Maximum Likelihood Estimator. Spending enough time in the Data Science community I am confident you would have come across the term Maximum Likelihood Estimator (MLE). I am not going to give a super in detail analysis of MLE, primarily because it has been done so many times in different ways that are probably better than I could ever explain it.
Which models are interpretable?
Data Scientists have the role to extract information from raw data. They aren't engineers, nor they are software developers. They dig inside data and extract the gold from the mine. Knowing what a model does and how it works is part of this job. Black-boxes models, although sometimes work better than other models, aren't a good idea if we need to learn something from our data.
A Machine Learning Tutorial for Operational Meteorology, Part I: Traditional Machine Learning
Chase, Randy J., Harrison, David R., Burke, Amanda, Lackmann, Gary M., McGovern, Amy
Recently, the use of machine learning in meteorology has increased greatly. While many machine learning methods are not new, university classes on machine learning are largely unavailable to meteorology students and are not required to become a meteorologist. The lack of formal instruction has contributed to perception that machine learning methods are 'black boxes' and thus end-users are hesitant to apply the machine learning methods in their every day workflow. To reduce the opaqueness of machine learning methods and lower hesitancy towards machine learning in meteorology, this paper provides a survey of some of the most common machine learning methods. A familiar meteorological example is used to contextualize the machine learning methods while also discussing machine learning topics using plain language. The following machine learning methods are demonstrated: linear regression; logistic regression; decision trees; random forest; gradient boosted decision trees; naive Bayes; and support vector machines. Beyond discussing the different methods, the paper also contains discussions on the general machine learning process as well as best practices to enable readers to apply machine learning to their own datasets. Furthermore, all code (in the form of Jupyter notebooks and Google Colaboratory notebooks) used to make the examples in the paper is provided in an effort to catalyse the use of machine learning in meteorology.
Provably Auditing Ordinary Least Squares in Low Dimensions
Measuring the stability of conclusions derived from Ordinary Least Squares linear regression is critically important, but most metrics either only measure local stability (i.e. against infinitesimal changes in the data), or are only interpretable under statistical assumptions. Recent work proposes a simple, global, finite-sample stability metric: the minimum number of samples that need to be removed so that rerunning the analysis overturns the conclusion, specifically meaning that the sign of a particular coefficient of the estimated regressor changes. However, besides the trivial exponential-time algorithm, the only approach for computing this metric is a greedy heuristic that lacks provable guarantees under reasonable, verifiable assumptions; the heuristic provides a loose upper bound on the stability and also cannot certify lower bounds on it. We show that in the low-dimensional regime where the number of covariates is a constant but the number of samples is large, there are efficient algorithms for provably estimating (a fractional version of) this metric. Applying our algorithms to the Boston Housing dataset, we exhibit regression analyses where we can estimate the stability up to a factor of $3$ better than the greedy heuristic, and analyses where we can certify stability to dropping even a majority of the samples.