This opening post sets the stage for a comprehensive discussion on defining an open standard for battery cycler output data within the Battery Data Alliance community.
By establishing an open standard for battery cycler output data, we aim to accelerate innovation and drive the energy transition forward.
Let’s engage in a constructive dialogue to shape the future of battery data analysis together.
Goal
By standard we are not thinking an opinionated implementation with regards to file format or programming language but rather a set of principles and rules that any given implementation can conform to in order to ensure (full or partial) compatibility across diverse datasets.
This standard must support implementations that ensures FAIR data.
Key Objectives
Standards: The output data from battery cyclers should adhere to a set of rules that ensure (full or partial) compatibility for
a. any battery cycler (that adheres to the standard)
b. irrespective of the programming language used
c. irrespective of how the data is persisted
Use Cases Exploration: By delving into real-world applications and requirements for battery data, we aim to uncover the diverse needs that such a standard can fulfill.
a. what kind of stakeholders would care about the data
b. what requirements are there for analysis, reporting, modelling / ML, …?
Discussion Focus
Primary data: What attributes are universal for primary battery test data. That is, data that is directly measured during the test.
Meta data: What attributes should always be present in order to give the primary data context and value.
This sounds like a great topic to bring up. I propose we begin by considering what existing standards or specifications already exist and identify any gaps or overlap. Would love to begin by adopting what someone has already started.
I am actually not aware of any standard that is somehow widely adopted, so I definitely see benefits in creating one.
Battery Ontology is focusing on naming/describing, so that when a standard is formed, something like capacity or upper voltage cutoff. It eliminates the need for making our own definitions when describing what data for example is mandatory. It tries to link to standards (IEEE and others) when available and enables a machine and human readable description.
Battery Data Genome is mostly a call for action and projects like Digibatt (including all european labs of the data genome) are working in that spirit.
Cycler output standard would therefore align with both.
As for a standard to emerge for cycler output, that would likely require the adoption from cycler companies or at least cycler users. As it stands today, IIRC each cycler brand outputs a file in their own format. Has there been any study so far showing how cycler formats line up compared to one another or a given ontology. In addition, the labs involved in Battery Data Genome and projects like Digibatt must have some format they work off of today?
From my last look when working on Battery Archive, there seemed to be little problem in defining a specific format, but to have that format adopted across multiple organizations. How problem specific are these formats? Could a standard definition be a matter of identifying a particular favored cycler brand’s format and consolidating others to it?
@smedegaard given your comments above regarding a standard as not an opinionated implementation, what level of detail are you looking to reach with this effort?
@gabe I was thinking of something as generalized as BattInfo. In my eyes there should be some set of attributes that all cyclers output (primary data) those should be described with a vocabulary that could very well be brought in from BattInfo.
Then there should be a set of meta data attributes that may need to be defined in a vocabulary that I’m not sure is present in BattInfo. Here I’m thinking of location, staff id, batch number and other data that gives the primary data context.
All of this should, ideally, be so general that it could be implemented with any given programming language (object-oriented, functional or otherwise) and any given method of persisting wether it’s row- or column based files; relational, document, wide-column, graph or whatever flavor database you may want.
The hope is that by approaching batteryj data management from this angle, the post-test value creation from data would be baked into the process from the start.
Our team would be happy to support this from an ontology point of view! As @smedegaard correctly points out, BattInfo is focused on domain terms like materials / battery terms / quantities / etc. For more general metadata terms, there are general semantic vocabularies like Dublin Core, DCAT, PROV-O, CSVW, etc. that can be combined with BattInfo.
I like the 5 Stars of Linked Data conceptual model for representing the key characteristics that linked data / metadata should have.
@valentin , I think we should definitely consider the Voltaiq Data Format in the design, but if we aim for community acceptance then we should also avoid being associated too strongly to one brand or another.
@gabe , these folks from Keysight are supportive of the vision for FAIR testing data and might be willing to support / adopt an open standard. They will also join the advisory board for DigiBatt.
The Voltaiq format is interesting. I would consider finding a way of decoupling the primary data and meta data so the two can evolve independently.
It also seems like a lot of overhead to have up to 1024 key:value pairs of meta data in csv headers. Finally, the csv headers makes it cumbersome to covert the data to column based binary files like the parquet format which is ubiquitous in machine learning and analysis of large datasets
pycti and pymacnet are designed to be just “dumb” wrappers around the provided protocols - CTI and MacNet by equipment manufacturers. As such they just read the data from the equipment and pass it on. It will be the responsibility of the downstream software to perform the data transformation as needed before being stored into the database or consumed elsewhere.
We have since implemented this scheme for other equipment and find this to be scalable. This is not to say that pycti or pymacnet cant be extended to include additional responsibilities.
If you can give an example of the data output feature you would like to see included, we can recommend if it would be a natural fit for the drivers layer or at a downstream layer.
@smedegaard I tried reading about FAIR but it isnt readily obvious what it means. If you already understand it, would you be able to create a one-pager or a summary that is like a FAIR ELI5 . I think we address metadata recording elsewhere, but I want to understand what else might be needed from the data format side.
I’m not an expert on the FAIR principles myself. I’ve only read a little about it.
Perplexity explains it like this:
The FAIR Principle in the context of scientific data stands for Findable, Accessible, Interoperable, and Reusable. These principles aim to enhance the management and stewardship of digital assets, ensuring they are easily discoverable, available for access, capable of integration with other data, and optimized for reuse. To elaborate:
Findable: Data and metadata should be assigned unique identifiers and described with rich metadata to enable easy discovery by both humans and machines.
Accessible: Data should be retrievable using standard protocols, with open and universal access, including authentication procedures when necessary.
Interoperable: Data should use a common language for knowledge representation, follow FAIR vocabularies, and include references to other data to facilitate integration with applications and workflows.
Reusable: Data should be well-described with relevant attributes, associated with clear usage licenses and provenance, and meet community standards to enable replication and combination in various settings.
These principles aim to promote transparency, collaboration, and the maximization of data’s value in scientific research and beyond, ultimately supporting knowledge discovery, innovation, and rigorous scholarship across different domains.
My interpretation of FAIR Principles is that they are not specific to data formats but rather attributes to describe the quality of the dataset. This checklist produced for use at the EUDAT summer school used to discuss how FAIR the participants data were, give us a rubric to help us evaluate.
Findable are more relevant to the data being searchable in a datastore or repository with a persistent identifier. Would establishing a DOI for each dataset make a resource sufficiently findable?
Accessible can be satisfied if it is made available via a direct web link.
Interoperability is the most relevant for this discussion of standard cycler output data. If we can define some means to make cycler data in a common / open format applying controlled vocabulary, keywords or ontologies where possible, that seems potentially useful.
Reusable is relating to the licensing and documentation. If we’re going to put the data anywhere, we would need to make sure licensing is clearly established.
For battery data, if a dataset had a DOI and was made available in a datastore like Battery Archive or GitHub with a new defined interoperable format and clear licensing and documentation, I think would that satisfy the FAIR requirements.
@valentin
What is the license associated with the Voltaiq Data format? I didn’t see anything jump out at me in the GitHub. Do they permit/welcome the community to make changes as needed?
I don’t have the link to the Voltaiq format at hand. But I think it’s important that the format proposed by a sub group of the Linux Foundation does not lean too heavily towards a commercial actor in the space.
Or maybe that’s not a concern?
My suggestion is just that we use the VDF as a starting point, give credit to Voltaiq, call it something else, then evolve it with our own governance (not making changes / PRs to the Voltaiq repo).
My only concern is Voltaiq published this repo without a license (as far as I can see) depending on the license they choose to publish it under, they may not allow the creation of derivative work. If they publish it with a permissive license, we can go that route.
However if they choose not to publish with an appropriately permissive license, I’d recommend starting from another format which has a permissive license such as battery archive, beep or another industry player.
@gabe they haven’t included any license but this is from the chapter “Voltaiq Data Format” of the white paper:
we open-source our guidelines and invite the community to evolve the guidelines into standards that can be adopted and accepted as best-practices across the community.
If we reached out to them we could confirm that we would like to build on their work under the BDA
@smedegaard I reached out to Voltaiq, they are eager to participate in BDA and will get back to me on the license for VDF. I am optimistic we will find some way to leverage the format at least as a starting off point to begin our development.
Given the latest, let’s say we will have the format under a useable license, what would be the next step on this thread?