ml development
From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience
Li, Zhiwei, Kesselman, Carl, Nguyen, Tran Huy, Xu, Benjamin Yixing, Bolo, Kyle, Yu, Kimberley
--Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled V ocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle. As machine learning (ML) becomes increasingly central to scientific discovery, concerns about correctness and reproducibility have grown [1]. In eScience, ML development is typically a collaborative and iterative process involving domain experts, data engineers, and ML researchers. These teams refine models based on evolving hypotheses and new data, creating feedback loops across data curation, feature engineering, modeling, and evaluation [2]. This dynamic process frequently introduces data cascades, where early curation errors propagate downstream, compounding over time [3]. In practice, ML workflows remain fragmented: datasets are shared informally, experiments span personal and cloud environments, and data, code, and configurations are often loosely coupled [4]. While MLOps and data management tools address parts of this problem, such as code versioning, pipeline orchestration, or environment encapsulation, they often overlook the full scientific lifecycle and the socio-technical realities of collaborative ML projects [5]. In prior work, we introduced Deriva-ML [6], a socio-technical platform that extends the FAIR principles (Findable, Accessible, Interoperable, Reusable) [7] across the ML developmental lifecycle.
- North America > United States > California > Los Angeles County > Los Angeles (0.15)
- North America > United States > California > Monterey County > Marina (0.04)
- Research Report (1.00)
- Workflow (0.83)
- Information Technology > Information Management (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Quality > Data Cleaning (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.34)
MLScent A tool for Anti-pattern detection in ML projects
Shivashankar, Karthik, Martini, Antonio
--Machine learning (ML) codebases face unprecedented challenges in maintaining code quality and sustainability as their complexity grows exponentially. While traditional code smell detection tools exist, they fail to address ML-specific issues that can significantly impact model performance, reproducibility, and maintainability. This paper introduces MLScent, a novel static analysis tool that leverages sophisticated Abstract Syntax Tree (AST) analysis to detect anti-patterns and code smells specific to ML projects. MLScent implements 76 distinct detectors across major ML frameworks including T ensorFlow (13 detectors), PyT orch (12 detectors), Scikit-learn (9 detectors), and Hugging Face (10 detectors), along with data science libraries like Pandas and NumPy (8 detectors each). Our evaluation demonstrates MLScent's effectiveness through both quantitative classification metrics and qualitative assessment via user studies feedback with ML practitioners. Results show high accuracy in identifying framework-specific anti-patterns, data handling issues, and general ML code smells across real-world projects. The software development landscape has undergone a dramatic transformation with the integration of Machine Learning (ML). Recent statistics from Gartner highlight this shift, revealing a striking 270% increase in ML adoption within enterprise software projects over the last four years [1]. This rapid adoption, however, brings its own set of complexities. Traditional software development practices have had to evolve significantly to accommodate ML's unique requirements, including the need for extensive datasets, sophisticated algorithms, and iterative development cycles [3]. These fundamental differences have catalyzed a complete reimagining of software development methodologies, from initial design through testing and maintenance [4], [5] which is also highlighted by Tang et al. [6] in their empirical study of ML systems refactoring and technical debt. ML projects introduce distinct code quality challenges that set them apart from conventional software development. The complexity stems from their inherent characteristics: intricate mathematical operations, extensive data preprocessing requirements, and sophisticated model architectures that challenge traditional code maintenance approaches [7].
- North America > United States (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Questionnaire & Opinion Survey (1.00)
- Research Report > New Finding (0.88)
TAACKIT: Track Annotation and Analytics with Continuous Knowledge Integration Tool
Lee, Lily, Fontes, Julian, Weinert, Andrew, Schomacker, Laura, Stabile, Daniel, Hou, Jonathan
Machine learning (ML) is a powerful tool for efficiently analyzing data, detecting patterns, and forecasting trends across various domains such as text, audio, and images. The availability of annotation tools to generate reliably annotated data is crucial for advances in ML applications. In the domain of geospatial tracks, the lack of such tools to annotate and validate data impedes rapid and accessible ML application development. This paper presents Track Annotation and Analytics with Continuous Knowledge Integration Tool (TAACKIT) to serve the critically important functions of annotating geospatial track data and validating ML models. We demonstrate an ML application use case in the air traffic domain to illustrate its data annotation and model evaluation power and quantify the annotation effort reduction.
- North America > United States > Massachusetts > Middlesex County > Lexington (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
What Makes An Expert? Reviewing How ML Researchers Define "Expert"
Human experts are often engaged in the development of machine learning systems to collect and validate data, consult on algorithm development, and evaluate system performance. At the same time, who counts as an 'expert' and what constitutes 'expertise' is not always explicitly defined. In this work, we review 112 academic publications that explicitly reference 'expert' and 'expertise' and that describe the development of machine learning (ML) systems to survey how expertise is characterized and the role experts play. We find that expertise is often undefined and forms of knowledge outside of formal education and professional certification are rarely sought, which has implications for the kinds of knowledge that are recognized and legitimized in ML development. Moreover, we find that expert knowledge tends to be utilized in ways focused on mining textbook knowledge, such as through data annotation. We discuss the ways experts are engaged in ML development in relation to deskilling, the social construction of expertise, and implications for responsible AI development. We point to a need for reflection and specificity in justifications of domain expert engagement, both as a matter of documentation and reproducibility, as well as a matter of broadening the range of recognized expertise.
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Massachusetts > Middlesex County > Waltham (0.04)
- (3 more...)
- Overview (0.87)
- Research Report (0.82)
- Health & Medicine (1.00)
- Education (1.00)
Service in review: Sagemaker Modeling Pipelines - DEV Community
Welcome back to my blog, where I share insights and tips on machine learning workflows using Sagemaker Pipelines. If you're new here, I recommend checking out my first post to learn more about this AWS fully managed machine learning service. In my second post, I discussed how parameterization can help you customize the workflow and make it more flexible and efficient. After using Sagemaker Pipelines extensively in real-life projects, I've gained a comprehensive understanding of the service. In this post, I'll summarize the key benefits of using Sagemaker Pipelines and the limitations you should consider before implementing it. This service is integrated with Sagemaker directly, so the user doesn't have to deal with other AWS services.
Does your business need AI/ML Development?
AI/ML is an innovative feat of technology that many look at as the future wave -- using its ability to analyze heaps of data (structured as well as unstructured) and make intelligent decisions without needing human intervention. AI/ML development companies can enable you to increase your productivity and minimize operational inefficiencies across several facets. Many businesses can leverage AI to employ complex use cases and support business development lifecycles- from customer support to marketing and lead generation. Here are some areas where hiring an ai/ml development company can prove to bring positive fruition for your business. Digital channels are a crucial part of any marketing program.
Introducing the Private Hub: A New Way to Build With Machine Learning
Machine learning is changing how companies are building technology. From powering a new generation of disruptive products to enabling smarter features in well-known applications we all use and love, ML is at the core of the development process. But with every technology shift comes new challenges. Around 90% of machine learning models never make it into production. Efforts get duplicated as models and datasets aren't shared internally, and similar artifacts are built from scratch across teams all the time.
How to find the business value in AI and ML
We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - August 3. Join AI and data leaders for insightful talks and exciting networking opportunities. There's no doubt that, when applied effectively, machine learning (ML) and artificial intelligence (AI) have proven potential to deliver significant value and cutting-edge technological innovation. But many organizations are struggling with the "effectively" part, according to a new survey. Despite the fact that businesses are increasingly undertaking initiatives to leverage ML and AI, many tools and projects lack appropriate resources, are far less productive than they should be, lag in deployment, and more often than not, fail or are abandoned. In short, business value is rarely captured – and very often falls short of expectations – because significant time, resources and budgets are being wasted, according to a 2021 survey of ML practitioners, "Too Much Friction, Too Little ML." "Building AI is hard," said Gideon Mendels, CEO and cofounder of Comet, the enterprise ML development platform company that commissioned the survey. "ML is often a slow, iterative process with many potential pitfalls and moving parts.
Top Emerging Machine Learning Trends For 2022
Machine learning creates algorithms that support machines in better comprehending data and making data-driven judgments. According to some observers, machine learning will become quite widespread by 2024, with the most emphasis in 2022 and 2023. Machine learning (ML) applications can be found in a variety of industries, including banks, restaurants, industrial plants, and even gas stations. The first and most important ML developments in IoT, which the majority of computer employees are looking forward to. As the cornerstone for IoT, a breakthrough in this area will significantly impact 5G adoption.
- Information Technology (0.52)
- Transportation > Ground > Road (0.37)
How to scale AI with a high degree of customization
In a previous post, I outlined four challenges to scaling AI: customization, data, talent, and trust. In this post, I'm going to dig deeper into that first challenge of customization. Scaling machine learning programs is very different to scaling traditional software because they have to be adapted to fit any new problem you approach. As the data you're using changes (whether because you're attacking a new problem or simply because time has passed), you will likely need to build and train new models. This takes human input and supervision. The degree of supervision varies, and that is critical to understanding the scalability challenge.