What is the deal with Dataset-JSON?

Unpacking the Complexities of Clinical Data Submission Standards

There is a storm brewing in the world of CDM (Clinical Data Management) and regulatory submissions. Since I came across the news that the FDA are considering Dataset-JSON as a new standard for submission I have conducted a lot of research to distil the wealth of information at our fingertips and understand what the potential impact might be.

Summary of Research

The most important lessons I learned during my research are:

  1. Dataset-JSON is part of the CDISC ODM v2.0 standard
  2. SAS can already import JSON with the LIBNAME JSON engine
  3. FDA will conduct further testing before changing regulations

Current Submission Formats

Since the 1999 guidance for Providing Regulatory Submissions in Electronic Format the FDA has mandated study data be submitted in SAS XPORT version 5 files.

Furthermore, according to the guidance Providing Regulatory Submissions In Electronic Format – Standardized Study Data and the more detailed requirements in the Study Data Technical Conformance Guide the following exchange format standards are supported by the FDA:

  • Extensible Mark-up Language (XML)
  • Portable Document Format (PDF)
  • File Transport Format (XPORT)

XML is widely accepted as the de facto format for describing metadata about datasets for analysis and reporting to regulatory agencies and comes in the form of a Define-XML file. The Define-XML file is a critical component of clinical trial data packages such as SDTM (data from raw sources) and ADaM (data for analysis) datasets as defined by CDISC.

PDF is also accepted as a standard for providing documents in electronic format to the International Conference on Harmonisation (ICH). However, the problem with PDF is that data cannot be extracted and analysed very easily. On the plus-side, the format is useful for maintaining an archive of study data and can be stored in a hierarchical fashion to replicate the organisation of a clinical trial.

XPORT (XPT) is an is an open file format published by SAS Institute for the exchange of study data. XPT version 5 was used from 1989 and version 8 was introduced in 2012. It would appear that the FDA have not accepted v8 XPT files because there is not a compelling enough reason to do so. The following table summarises the few differences in each version, suggesting some reasons why the new version has not been widely adopted.

SAS PropertyVersion 5Version 8
Variable NamesUp to 8 charactersUp to 32 characters
Character Variable LengthsUp to 200 bytesUp to 32,767 bytes
Variable LabelsUp to 40 charactersUp to 256 characters
Comparison of SAS XPT Versions

Don’t forget the regulations in FDA 21 CFR Part 11 that are for those who maintain records or submit information to the FDA; and which requires all datasets submitted in electronic format to provide an accurate and complete copy of the data suitable for inspection, review, and copying.

Answering Some Questions

What are the Limitations of SAS v5?

In 2017 PHUSE released a whitepaper discussing suggestions for replacing SAS version 5 transport format. In this document they listed the following limitations that needed to be overcome if a new standard for submissions should be adopted by the FDA.

  • Limited variable types means some data is incorrectly interpreted.
  • Limited to US ASCII encoding which prevents use of double-byte characters used in many parts of the world.
  • Field names, variable names, variable labels, and character widths are restricted which reduces how much information can be conveyed.
  • Inefficient use of storage space caused by empty column space not being allocated.
  • Datasets cannot be compressed which leads to logistical file issues, especially when the maximum file size is 5 Gigabytes.
  • Lacks a robust metadata layer which means external files like Define-XML must be kept synchronised.
  • Only works for 2-dimensional data structures which means more complex structures such as nested objects, graphs, trees, and linked lists are not possible.
  • The format cannot be extended which means it is less compatible with modern technology.

Who is JSON?!

JSON (JavaScript Object Notation) pronounced jason was derived from the JavaScript programming language but they have no relationship to each other beyond the name. JSON is a lightweight format for storing and transporting data and is human-readable so is easy to understand.

What is Dataset-JSON?

A dataset is the term used to describe a collection of structured data in a single file. According to CDISC, Dataset-JSON seeks to address the limitations of SAS V5 XPORT. Dataset-JSON

What about Dataset-XML and Define-XML?

Dataset-XML was created by CDISC in 2014 as a replacement to SAS XPT V5 and removed many of the V5 transport file restrictions.

Dataset-XML and Define-XML are different but complementary standards; Define-XML metadata describes the Dataset-XML dataset content. Dataset-XML can represent any tabular dataset and supports all language encodings supported by XML.

What about ODM?

ODM (Operational Data Model) is another CDISC standard. It is used for exchanging and archiving clinical and translational research data; and has been implemented by many electronic data capture (EDC) tools. The latest version is 2.0 and was released in 2023 to include the option of transporting data in XML or JSON format. To be clear, ODM is not intended for submission to the FDA.

Will we still need to use SAS?

In short, yes.

SAS will continue to be used for statistical analysis before submitting data to the FDA.

It is expected that clinical technology companies will provide XPT files alongside JSON in the transition period between JSON being accepted and XPT files being retired as the required format for submissions. The good news is that SAS already has the ability to import JSON for review and this can be achieved with the LIBNAME JSON engine. Once the data has been reviewed satisfactorily in SAS it is anticipated that the same data will be submitted to the FDA.

Fitting it all Together

As a clinical data manager you might be asking how does this impact me? Well the following answers are a good place to start.

  • First, you need to be aware of these potential changes coming down the line. I recommend monitoring this FDA Data Standards page.
  • Second, the FDA will provide plenty of notice before mandating a new standard. The Study Data for Submission to CDER and CBER page of the FDA website says at least one year’s notice will be given before a new version of a standard is required.
  • Third, Dataset-XML was not accepted so it is possible that Dataset-JSON is not the next standard adopted by the FDA for submissions.

As of August 2024 the FDA stated they will conduct further testing to evaluate Dataset JSON’s capability to support the submission of regulatory study data. I for one, will be closely monitoring this project to see what the outcome is!

Further Reading

Whitepaper describing working with Dataset-JSON using SAS – submitted to Pharma SUG (May-2022)
https://www.pharmasug.org/proceedings/2022/AD/PharmaSUG-2022-AD-150.pdf

Example JSON datasets for ODM V2 in GitHub (Aug-2023)
https://github.com/cdisc-org/DataExchange-DatasetJson

Post about Draft Dataset-JSON API on LinkedIn (Mar-2024)
https://www.linkedin.com/pulse/dataset-json-api-sam-hume-jdeye/

Slides about Dataset-JSON as an alternative transport format for regulatory submissions pilot (Apr-2024)
https://www.cdisc.org/sites/default/files/2024-04/2024-EU-Interchange-Session-6B-SamHume-final.pdf

CORE (Conformance Rules) is a free and open software to test study data for conformance to CDISC standards as well as to regulatory and sponsor-specific conformance rule sets (Jul-2024)
https://www.cdisc.org/core

CDISC Public review webinar on Dataset-JSON v1.1 (Sep-2024)
https://www.cdisc.org/events/webinar/dataset-json-v1-1-public-review-webinar

Unpacking the Complexities of Clinical Data Submission Standards

 

Search

Popular Posts

  • What is Electronic Source Data in Clinical Trials?
    What is Electronic Source Data in Clinical Trials?

    6 Types of Electronic Source Data This article describes 6 types of electronic source data originators in clinical trials. It also discuses what information should be attributed to data elements from electronic sources and ends with a view of where the industry might go next. The clinical trials industry has collected data electronically for many…

  • Why do we need so much functionality in software?
    Why do we need so much functionality in software?

    So Much Functionality in Software This article explains how much of software functionality we tend to use and explores why there are so many features to choose from. We end with offering a few suggestions for how to expand your knowledge of unused features. Unused Functionality in Software Research conducted by The Standish Group about…

  • What is the deal with Dataset-JSON?
    What is the deal with Dataset-JSON?

    Unpacking the Complexities of Clinical Data Submission Standards There is a storm brewing in the world of CDM (Clinical Data Management) and regulatory submissions. Since I came across the news that the FDA are considering Dataset-JSON as a new standard for submission I have conducted a lot of research to distil the wealth of information…

Categories

Tags