Open standard for cycler output data

Great to hear @gabe :+1:

I have had some time to think about ways to proceed on the data structure.
My general thinking is to use a data format that supports nested data types to hold the meta data of a given battery test. That is the json part of VDF.
Depending on the implementation, the test data (the csv part) would be inside a nested structure as well (depending on the db/file format, often called a struct)

A good first reference implementation could be using the Delta Lake format.

Inviting the folks from Voltaiq to this thread is a good first step.

What are peopleā€™s thoughts on opinionated vs flexible column names in the file format?
Opinionated: you have to call the voltage ā€œVoltage [V]ā€ and its units have to be volts
Flexible: projects have a config file that specifies these things: {VOLTAGE_NAME: ā€œVoltageā€, VOLTAGE_UNITS": V}

My thoughts are that it depends if we care more about the standard or the data management software, but if we care about the standard, we should be opinionated. If we care about the data management software, we should be more flexible. Being flexible would allow us to read in data from a variety of other public sources without needing to convert and store it in our n+1th standard. But it might makes things unnecessarily complicated in practice

Good question @valentin,

As you say, it depends on what we are solving for.

In my mind itā€™s about managing and working with large amounts of test data created by cyclers and their software.
I donā€™t see how we can accomplish that without defining some structure for the data. It can have some flexibility in terms of evolving the schema, but my worry is that manually creating configuration file may corrupt reading the data.

But I donā€™t have lab experience so I might be missing some context from how the equipment and software is used in the wild.

Hi all.

Iā€™ve given it a little thought and I think a usable solution could be a nested structure in any data format/database that supports it.

The big question for me is if itā€™s too unfamiliar for people to work with.

Iā€™ve made a notebook that only defines a data schema for now. I will expand with some dummy data, and see how well it performs when reading and writing large amounts of data.

The data schema you defined, does your example leverage the VDF? And the battinfo battery ontology?

Naming of the fields is inspired by vdf. There is no ontology integration yet.

Adding to the discussion, BattGenie has already open-sourced data ingestion python package - BattETL - that works with Maccor and Arbin data, and a database to store data as well as metadata - BattDB.

BattETL: GitHub - BattGenie/battetl: A module for extracting, transforming, and loading battery cycler data to a database.
BattDB: GitHub - BattGenie/battdb

We will soon be releasing, a FOSS version of both above under the banner of BDA, which will also include a visualization layer for the ingested data.

Let me know if there any questions about the above.

Here is the video from our meeting discussing how to leverage Battery Ontology with BattETL & BattDB.

Thanks @DrSimonClark for the demo!

Continuing this thread, thereā€™s since been quite a bit of progress including work on the Battery Data Format.

Thanks @DrSimonClark for publishing the first cell in the BDF to Battery Knowledge Base Discharging Time Series of a CR2032 Battery at 11 mA

This really shows the power of the Battery Ontology and aligning with best practices and shows the possibilities of what these tools can do. Looking forward to see what this will enable.

We recently had a great conversation with folks from Faraday Institute who are working on a similar problem and their solution is largely aligned. After some discussion, the next BDF proposal would be to align to 4 required fields (if we adopt cycle_dimensionless ā†’ step_count)

  • test_time_millisecond
  • step_count
  • current_ampere
  • voltage_volt

With the following open questions remaining:

  1. When to increment step? is it aligned to the program step and can repeat or is it increasing by 1 and never decreasing? or is it aligned to software protocol?
  2. Time measured in seconds vs milliseconds
  3. Naming convention, how to include units in the name. The choice to use _unit was to ease machine readability and parsing for unit conversions

Interested to know what folks think of the above.

  1. When to increment step? In my opinion, it should be monotonically increasing anytime there is a change in the control mode.

  2. Time measured in seconds vs milliseconds? Second is more intuitive and user-friendly than millisecond. I suspect millisecond was chosen because it is more likely to be an integer. But I donā€™t have a strong opinion about it.

  3. Naming convention for units? The requirements that I see are that they should be (i) easily parse-able, (ii) explicitly clear, (iii) align to an existing standard. Plain symbols for units open up case-sensitivity issues (e.g. s and S) - so the actual name is more explicit/robust (e.g. second and siemens). Pint offers a good package for that, doing unit conversions, and mapping to unit ontology terms. Thatā€™s why this style was recommended.

millisecond vs. second, etc.
In general I would just advice to not truncate the values early on. Numbers can always be formatted to be human readable. But decimal point values may carry information for machine learning and analysis

about units: a more robust way than putting the unit in the field name would be adding a data structure just for units like in the notebook I shared earlier.
In python that could look something like

@dataclass
class SIUnitDataMixin:
    sign: str
    unit: int
    dimension: str

class SIUnit(SIUnitDataMixin, Enum):
    AMPERE = ("A", "ampere", "electric current")
    AMPERE_HOUR = ("Ah", "ampere-hour", "electric capacity")
    CELCIUS = ("c", "celcius", "temperature")
    VOLT = ("V", "volt", "electric potential")
    MILLISECOND = ("ms", "millisecond", "time")

    def to_dict(self):
        return {
            "sign": self.sign,
            "name": self.unit,
            "dimension": self.dimension
        }

Example

SIUnit.AMPERE_HOUR.to_dict()
{'sign': 'Ah', 'name': 'ampere-hour', 'dimension': 'electric capacity'}

Depending on the database or type of files one uses for storing the data a measurement could either

point to a table of SI units in a relational database:

or have the unit embedded a la

{
...,
"measurements" : [
  {
    "value": 42.42,
    "unit": {sign: "A", "unit": "ampere", "dimension": "electric current"}
  },
  {
    "value": 3.14,
    "unit": {sign: "A", "unit": "ampere", "dimension": "electric current"}
  },
].
...
}

wrt units: putting it in the field name is mostly just a helpful flag for the humans. The real work is handled in the metadata.

What you describe @smedegaard is the functionality that we want, but weā€™d rather build on existing infrastructure than create something new. Unit definitions, conversions, and quantity validation have been handled extensively in resources like pint, qudt, and emmo.

The metadata for the BDF table schema handles the declaration of units for the machines. But if a human wants to parse the key to work with the unit in pint, they can do that.

Thank you for the invitation to contribute to this discussion @gabe.

On the format of units:

I think my perspective (from developing the user-interface layer to a battery data format with PyProBE) is aligned with a point I see Valentin made in this thread before. I appreciate the motivation for avoiding ambiguity that a verbose spelling of the unit achieves, but I see a couple of drawbacks:

  1. Ease of use and user acceptance- I would suggest it is ok for a standard to be opinionated, and the SI symbol standard is well established so distinguishing between upper and lower case S for seconds and sieverts seems like a relatively minor problem. In contrast, spelling out the unit can get convoluted especially with combination units. An extreme example of this would be heat flux (watts-per-metre-squared?). While compound units like these are relatively rare in time-series battery data, they are extremely common in the parameters of battery models which a researcher may well be computing from the data stored in the bdf format. It would be nice I think for the human-readable string form of these quantities to remain consistent when passed from raw data, to processing tools and finally to models.
  2. A more concise unit format can be also plotted directly from the data source, for example with pandas df.plot(ā€œTime [s]ā€, ā€œVoltage [V]ā€) while a more verbose form would require manual override of the axis labels. While this would be very simple, it may be off-putting to a user. It also would mean that the standard would be obscured when the data is presented in published work.
  3. Finally, the unit names are not without ambiguity- spelling varies in different languages, and even in British and American English: metre vs meter!

I 100% agree with using existing tools like pint for user interaction with units (my current implementation of units in PyProBE works for common battery units, but is very basic and not scalable so I have it on my to-do list to integrate pint), but as the column names are user-facing, conciseness and simplicity is important.

I can see that there are a few different ways that units are described in the table schema on top of the column name, I assume these are each serving separate purposes like linking to other standard (emmo for instance)? Are there parallels with this metadata field and something like ā€œknown aliasesā€, where the different column names and unit styles of battery cyclers could be recorded e.g. ā€œbiologic:Ecell/Vā€? Sorry if this is the wrong end of the stick; this may be better suited to a database that links the standard to the multitude of other formats that exist. My implementation is a set of classes that can convert column names and compute derived columns (capacity from charge and discharge capacity for example), but I see maintaining the various cycler format parsers as a significant ongoing maintenance challenge so was curious to hear your thoughts.

On the definition of step:

I think it is most useful if it is aligned with program step for identifying sections of an experiment. It then naturally also includes cycle information, which can be identified from when the step number decreases I.e. Step = [1,2,3,1,2,3], Cycle=[1,1,1,2,2,2]. It isnā€™t clear to me why you would ever remove this information, if your cycler has recorded it, in order to make this column monotonically increasing. I donā€™t think anything is lost by allowing step numbers to repeat- if the data source provides repeating step numbers then individual program instructions can be separated, and if not the behaviour would fall back to a monotonically increasing count.

wrt units:

The column name is for human use and readability, and so ease of use and user acceptance should be prioritized. That comes down to a stylistic choice.

Using a formatted label, like ā€œVoltage / Vā€ is acceptable, as long as the symbol for the unit can be parsed by existing tooling like pint to provide an extra layer of robustness.

  • This has the pro that it is nicely formatted for human-readability.
  • It has the cons that it includes capitalization, spaces, and special characters - which all present opportunities for mistakes.
  • It could also have formatting problems for units with superscripts like W/mĀ² .
  • In my opinion, ā€œvoltage_voltā€ is more human-friendly because it has fewer opportunities for mistakes, even though it is more verbose.

If we do go with formatted labels, then we should note that the SI and IUPAC best practice for quantity notation is to use a slash rather than parenthesis or brackets (e.g. ā€œVoltage / Vā€) - because quantities follow the rules of algebra. (covered by section 5.4.1 of the 2019 SI Guide or section 1.1 of the IUPAC Green Book).

The main thing is that the units are robustly handled in the metadata, which is covered. :white_check_mark:

Remember that we can define all kinds of labels in the metadata using rdfs:label or skos:altLabel or skos:hiddenLabel. Known aliases are compiled in the parent class terms from the electrochemistry domain ontology (e.g. current), which can be queried in the event that a table schema is not available.