R User conference, I spoke on digital provenance, the importance of reproducible research, and how Domino has solved many of the challenges faced by data scientists when attempting this best practice. What are you doing to ensure that you're mitigating the many risks associated with provenance (or lack thereof)? Reproducibility is important throughout the entire data science process. As recent studies have shown, subconscious biases in the exploratory analysis phase of a project can have vast repercussions over final conclusions. The problems with managing the deployment and life-cycle of models in production are vast and varied, and often reproducibility stops at the level of the individual analyst.
Science and engineering are information intensive collaborative enterprises which have their own complex and rapidly evolving processes for creating, discovering, and analyzing digital artifacts. To support multiple contexts (Virtual Organizations), infrastructure must provide general purpose mechanisms for reasoning about processes and data. This paper considers one important aspect of this problem, recording and reasoning about provenance. The Open Provenance Model (OPM) provides a case study for how to use Semantic Web technology and rules to implement semantic metadata. This paper discusses a binding of the OPM written in OWL, with rules written in SWRL. These standards are useful, but cannot implement the entire OPM. However, the use of RDF enables the development of "hybrid" systems that use OWL, SWRL, and other semantic software, interoperating through a shared space of RDF triples.
Conceptualized in 1813 and later formalized in 1962, the Federal Depository Library program made US federal publications freely accessible by allowing agencies to send documents postage-free to libraries across the country for deposit. It was perhaps one of the government's earliest attempts at an "open" framework that provided information to the public. While the push for open data is not new, there remains a lack of consensus on how to best leverage secondary use of the data. To aid this endeavor, the research community has recently adopted "FAIR" data principles, pledging to make data "Findable, Accessible, Interoperable, and Re-usable." But what does it mean to make data FAIR and how can open data be used to its upmost potential?
The British company Provenance says it is lighting a fire under the retail world. The company has developed an app that allows retailers and customers to see where a product comes from - from origin to point of sale. "Behind every product is a complex chain of people and places, and that's a really important part of why people buy things," founder Jessi Baker explains. "Provenance is all about making that information transparent to shoppers, but also to businesses, all along the supply chain."
The usefulness of intelligent applications/services reasoning with linked data is dependent on the availability and correctness of this data. The crowd potentially has an important role to play in performing the non-trivial tasks of creating, validating, and maintaining the online linked data sets used by applications and services. Additional information captured within a provenance record can be used in these tasks and others, such as evaluating the performance of the crowd and its members. In this paper we describe two roles for the crowd in the web of linked data (creation and maintenance), and argue that incorporating provenance into these tasks is beneficial especially in scenarios when the population of available workers is small. We also identify several challenges for the use of provenance in this context and define a set of requirements for a provenance model to address these challenges.