BDF: What's Been Built and What's Next

This post is a standing summary of where the Battery Data Format is at. I’ll keep it updated as things move. If you’re new here, the open standard thread has the community discussion history — but here’s the short version.

Background

BDF started with a specification donation from Ohm in May 2025. In parallel, the community had been discussing what an open standard for cycler output data should look like since early 2024 — evaluating existing formats, ontologies, and data management approaches. That discussion informed the direction as the donated spec was refined into what shipped.

Key contributions along the way: @DrSimonClark brought ontology alignment with BattINFO/EMMO, @smedegaard pushed for Parquet-compatible nested structures and kicked off the original spec discussion, and @TomHolland contributed the PyProBE integration.

Timeline

  • May 2025 — Ohm donates the initial BDF specification
  • Aug 2025 — Largest open source battery dataset contribution: 199 cells (NMC//graphite and LFP//graphite), each tested for 1,000 cycles under fully automated workflows
  • Dec 2025 — LF Energy publicly releases BDF
  • Jan 2026 — Partnership announcements: BattInfo Ontology, Faraday Institution (PyProBE and BDX), Microsoft Open Battery Dataset contribution, Ohm BDF Converter
  • Feb 2026 — BDF Python package released on PyPI
  • Mar 2026 — Dataset donation from Microsoft Surface Battery Development

What’s been decided

  • Canonical column schema: BDF standardizes column labels and units so datasets from different cyclers can be compared without custom glue code. Required columns are Test Time / s, Voltage / V, and Current / A, with recommended columns for cycle count, step count, temperature, etc.
  • Ontology alignment: column names and metadata terms map to BattINFO/EMMO, making BDF datasets machine-interpretable.
  • Units in column names: SI-style labels (e.g. Voltage / V) as the promoted convention, with tolerance for [] and () variants. Unit semantics live in metadata and are parseable by pint.
  • Time in seconds, not milliseconds. Seconds are more natural for test durations and precision loss as float is negligible for serialized data.
  • Step index: monotonically increasing, incremented on any change in control mode.

What’s still open

These are the active spec questions. If you have experience here, the GitHub issues are the right place to weigh in.

Datastore

The bdf-datastore is a growing collection of real battery datasets in BDF format, organized by contributor and cell. It currently includes datasets from SINTEF and Microsoft, with contributions structured as raw vendor data alongside processed BDF files and metadata. The 199-cell dataset from European research labs (NMC//graphite and LFP//graphite, 1,000 cycles each) is the largest open source contribution to date, and Microsoft Surface Battery Development donated a dataset in March 2026.

Contributing data is straightforward — fork the repo, add your dataset following the folder conventions, and open a PR.

Tooling

  • pip install batterydf — reads vendor exports, normalizes to BDF, validates, and produces metadata. Supports Neware NDA, interactive plotting via Plotly and hvplot. PyPI · GitHub
  • @TomHolland’s PyProBE provides a user-interface layer compatible with BDF
  • @DrSimonClark published the first individual BDF dataset to Zenodo: CR2032 discharge time series

Last updated: April 2026

Thanks for the summary!

I’d like to share some pain points from field experience, although I suspect most of it has already been discussed.

  • Temperature sensors: indeed this is a huge pain point as their interpretation can vary greatly from experiment to experiment. I think just numerating the surface sensors (T1, T2, …) is the best solution. Alternatively, any name could be allowed and temperature just identified by the degree Celsius unit, and only the ambient temperature standardized.
  • I would definitely standardize the ambient temperature sensor name because it’s special. Sometimes, it’s useful to also include the climate chamber setting (rather than the actual ambient temperature) as a column. Standardizing that one might make sense, too.
  • Temperature units: I think I’ve seen every possible representation of °C, `°C`, `degC`, `C`, `gradC`, `°C`, `�C` … I suggest to be strict about whatever the stadard defines as the required unit, and not to try to interpret it.
    • Personally, I very much prefer `Cel` instead of `degC`, as defined in UCUM
  • Tolerance for variants of unit notation (`[]`, `()`): it would be great if this could be avoided. I suggest to be strict. Or at least forbid `[]()` in column names then (again I’ve seen every possible combination)
  • Time colums
    • we always made the Unix timestamp a required column, because that’s the only reliable way to combine data from tests spread across multiple files, which happens often in practice. Even if you could hypothetically track a “global test time”, this is not something I would ever rely on in practice. I understand it might not always be feasible to have absolute time (e.g. in publicly shared datasets), but I would recommend to strongly encourage it.
    • I don’t see why test time is required when absolute time is present
    • In databases, storing times (absolute or relative) as integers rather than floats has many benefits. (We used 64 bit nanosecond timestamps.) If you already settled for floats that’s ok.
1 Like

Thanks for your comments and engagement, David :raising_hands: :battery:

Some thoughts on the points you raised:

  • Temperature sensors: yes, this T1, T2, … approach is what we landed on. It gets the job done, but the next thing we need to think about is how to express the location of the surface sensor (e.g. in coordinates, keywords, or something else…)

  • I’m on the fence about distinguishing between measured and set ambient temperature. On the one hand, it’s certainly helpful to have both values. On the other hand, the measured temperature is the one that matters. But it’s probably good to support both.

  • The semantics of the units are actually encoded in the application ontology term, and that buys you access to QUDT, EMMO, UCUM, pint, etc. for doing things like conversions, dimensional analysis, etc. The string description of the unit is really only a back-up that can be parsed with pint if all else fails.

  • I agree to strictly forbid brackets and parenthesis for units. Many people forget that a quantity is a mathematical entity that is the product of a value and a unit, such that the use of / is not a style choice, but carries meaning. So, currently we support a human label Voltage / V and a machine label voltage_volt … we may consider including aliases with skos:hiddenLabel - but I think only as a backend robustness resource.

  • I agree that Unix timestamp should be the required column and Test Time should then be optional. This should be the strong recommendation going forward. The only reason we haven’t made that “official” yet is because most legacy datasets don’t include a date-time. But I agree with your reasoning. Wrt integers v floats, we originally supported time in ms to make it an integer, but we go the feedback that it was confusing to work with and we don’t prefix any other units - so we moved to seconds.

Thanks again for your post. You should come to the monthly meetings to get more involved! :+1:

1 Like