internal data
JEL: Applying End-to-End Neural Entity Linking in JPMorgan Chase
Ding, Wanying, Chaudhri, Vinay K., Chittar, Naren, Konakanchi, Krishna
Knowledge Graphs have emerged as a compelling abstraction for capturing key relationship among the entities of interest to enterprises and for integrating data from heterogeneous sources. JPMorgan Chase (JPMC) is leading this trend by leveraging knowledge graphs across the organization for multiple mission critical applications such as risk assessment, fraud detection, investment advice, etc. A core problem in leveraging a knowledge graph is to link mentions (e.g., company names) that are encountered in textual sources to entities in the knowledge graph. Although several techniques exist for entity linking, they are tuned for entities that exist in Wikipedia, and fail to generalize for the entities that are of interest to an enterprise. In this paper, we propose a novel end-to-end neural entity linking model (JEL) that uses minimal context information and a margin loss to generate entity embeddings, and a Wide & Deep Learning model to match character and semantic information respectively. We show that JEL achieves the state-of-the-art performance to link mentions of company names in financial news with entities in our knowledge graph. We report on our efforts to deploy this model in the company-wide system to generate alerts in response to financial news. The methodology used for JEL is directly applicable and usable by other enterprises who need entity linking solutions for data that are unique to their respective situations.
- Banking & Finance (1.00)
- Information Technology > Security & Privacy (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Statistical inference for case-control logistic regression via integrating external summary data
Shi, Hengchao, Liu, Xinyi, Zheng, Ming, Yu, Wen
Case-control sampling is a commonly used retrospective sampling design to alleviate imbalanced structure of binary data. When fitting the logistic regression model with case-control data, although the slope parameter of the model can be consistently estimated, the intercept parameter is not identifiable, and the marginal case proportion is not estimatable, either. We consider the situations in which besides the case-control data from the main study, called internal study, there also exists summary-level information from related external studies. An empirical likelihood based approach is proposed to make inference for the logistic model by incorporating the internal case-control data and external information. We show that the intercept parameter is identifiable with the help of external information, and then all the regression parameters as well as the marginal case proportion can be estimated consistently. The proposed method also accounts for the possible variability in external studies. The resultant estimators are shown to be asymptotically normally distributed. The asymptotic variance-covariance matrix can be consistently estimated by the case-control data. The optimal way to utilized external information is discussed. Simulation studies are conducted to verify the theoretical findings. A real data set is analyzed for illustration.
- Research Report > Experimental Study (0.85)
- Research Report > New Finding (0.85)
A Taxonomy of Foundation Model based Systems for Responsible-AI-by-Design
Lu, Qinghua, Zhu, Liming, Xu, Xiwei, Xing, Zhenchang, Whittle, Jon
The recent release of large language model (LLM) based chatbots, such as ChatGPT, has attracted significant attention on foundation models. It is widely believed that foundation models will serve as the fundamental building blocks for future AI systems. As foundation models are in their early stages, the design of foundation model based systems has not yet been systematically explored. There is little understanding about the impact of introducing foundation models in software architecture. Therefore, in this paper, we propose a taxonomy of foundation model based systems, which classifies and compares the characteristics of foundation models and design options of foundation model based systems. Our taxonomy comprises three categories: foundation model pretraining and fine-tuning, architecture design of foundation model based systems, and responsible-AI-by-design. This taxonomy provides concrete guidance for making major design decisions when designing foundation model based systems and highlights trade-offs arising from design decisions.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.47)
Data-Driven Joint Inversions for PDE Models
The task of simultaneously reconstructing multiple physical coefficients in partial differential equations from observed data is ubiquitous in applications. In this work, we propose an integrated data-driven and model-based iterative reconstruction framework for such joint inversion problems where additional data on the unknown coefficients are supplemented for better reconstructions. Our method couples the supplementary data with the PDE model to make the data-driven modeling process consistent with the model-based reconstruction procedure. We characterize the impact of learning uncertainty on the joint inversion results for two typical model inverse problems. Numerical evidences are provided to demonstrate the feasibility of using data-driven models to improve joint inversion of physical models.
4 Ways Alternative Data Is Improving Fintech Companies in APAC - Fintech Hong Kong
Various categories of fintech firms – Buy Now, Pay Later (BNPL), digital lending, payments and collections – are increasingly leveraging predictive models built using artificial intelligence and machine learning to support core business functions such as risk decisioning. According to a report by Grand View Research, Inc., the global AI in fintech market size is expected to reach US$41.16 billion by 2030, growing at a compound annual growth rate (CAGR) of 19.7% in Asia-Pacific alone from 2022 to 2030. The success of AI in fintech, or any business for that matter, hinges on an organisation's ability to make accurate predictions based on data. While internal data (first-party data) needs to be factored into AI models, this data often fails to capture critical predictive features, causing these models to underperform. In these situations, alternative data and feature enrichment can establish a powerful advantage.
- Asia > China > Hong Kong (0.40)
- South America (0.05)
- North America > Central America (0.05)
- Asia > Southeast Asia (0.05)
Save Sarah Connor with Data Science - KDnuggets
Data science and data privacy are deeply interwoven, and must be carefully considered by practitioners. In comparing the Safe Harbour and Expert Determination data obfuscation approaches, Safe Harbour has been very popular among data engineers but has fundamental limitations, where Expert Determination offers important advantages.
- North America > United States > California (0.08)
- North America > United States > Vermont (0.05)
- North America > United States > Maine (0.05)
Tinder reveals most popular trends, songs in its 2020 'Year in Swipe'
Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. While the coronavirus pandemic has hindered a fair amount of in-person dates, Tinder users have kept on swiping, according to the app's newly released "Year in Swipe 2020" report. In its data-backed findings, Tinder suggests that Gen Z "never stopped dating" and instead discovered creative ways to stay connected with their potential matches, which included updating their profile bios and sending direct messages. Tinder released its annual "Year in Swipe" report for 2020.
In 2021, off-the-shelf datasets will be on the rise for AI model development
If there's one thing that companies large and small can agree on, it's that deploying effective artificial intelligence (AI) is challenging. Not every organization has the funds, specialized teams, and annotators required for a large-scale AI deployment, and even those that do struggle with collecting enough high-quality data to build accurate models quickly, or update them with the right frequency. Deploying and maintaining AI with speed is essential for a competitive advantage in this rapidly-evolving space, which is why many companies are looking to third-party options that enable them to scale quickly. In particular, organizations are increasingly relying on off-the-shelf, or pre-built, datasets to provide needed data conveniently with limited risk. These datasets are cost-effective alternatives that can accelerate deployments and provide that last percentage or two of accuracy required to meet desired confidence thresholds.
Learning a faceted customer segmentation for discovering new business opportunities at Intel
Lieder, Itay, Segal, Meirav, Avidan, Eran, Cohen, Asaf, Hope, Tom
For sales and marketing organizations within large enterprises, identifying and understanding new markets, customers and partners is a key challenge. Intel's Sales and Marketing Group (SMG) faces similar challenges while growing in new markets and domains and evolving its existing business. In today's complex technological and commercial landscape, there is need for intelligent automation supporting a fine-grained understanding of businesses in order to help SMG sift through millions of companies across many geographies and languages and identify relevant directions. We present a system developed in our company that mines millions of public business web pages, and extracts a faceted customer representation. We focus on two key customer aspects that are essential for finding relevant opportunities: industry segments (ranging from broad verticals such as healthcare, to more specific fields such as 'video analytics') and functional roles (e.g., 'manufacturer' or 'retail'). To address the challenge of labeled data collection, we enrich our data with external information gleaned from Wikipedia, and develop a semi-supervised multi-label, multi-lingual deep learning model that parses customer website texts and classifies them into their respective facets. Our system scans and indexes companies as part of a large-scale knowledge graph that currently holds tens of millions of connected entities with thousands being fetched, enriched and connected to the graph by the hour in real time, and also supports knowledge and insight discovery. In experiments conducted in our company, we are able to significantly boost the performance of sales personnel in the task of discovering new customers and commercial partnership opportunities.
- Semiconductors & Electronics (0.40)
- Information Technology > Hardware (0.40)
How to create a data set for machine learning with limited data
Analysts agree that the more data you have, the better trained your models will be. So how does a data shortage factor in when determining how to create a data set for machine learning? The solution may be to look for data in unique places and pull from research and prior collection. At the recent AI World Conference & Expo, data scientist Madhu Bhattacharyya, managing director of enterprise data and analytics at global consultancy firm Protiviti talked internal data shortages, mediating bias and the importance of external data collection. What are some tips for how to create a data set for machine learning if you have limited internal data?