There are a number of descriptive elements that help identify the sources of data and are well worth including
We believe all data should recognize it's source, including it's copyright, licensee and which application has been used to generate the file.
All fields are optional but including them helps to provide better context.
<source>
<copyright>ABC Co</copyright>
<licensee>XYX Co</licensee>
<application>DataEngine 1.0</application>
</source>
"source": {
"copyright": "ABC Co",
"licensee": "XYX Co",
"application": "DataEngine 1.0"
}
We will consider adding additional standardized items to this section to represent scenarios where data has been combined from multiple sources.
Secondly, we want to include an audit trail of which options were used to create the file. We've processed many files in the past which have been subtly altered by the provider due to them changing the criteria they have used to generate the data.
Most of the options available here are outside the scope of the schema definition as they will depend on the tool generating the data file. The most important aspect is to be consistent in using the same fields for the same purpose in all files that you generate.
<criteria>
<creationDate>2019-04-01 23:20:19Z</creationDate>
<channels>1,2,3,4,5,6,7,8,9,10</channels>
<dateRange from="2019-01-01 00:00:00" to="2019-01-31 23:59:59" />
<regions>UK,US</regions>
</criteria>
"criteria": {
"creationDate": "2019-04-01 23:20:19Z",
"channels": [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ],
"dateRange": {
"from": "2019-01-01 00:00:00",
"to": "2019-01-31 23:59:59"
},
"regions": [ "UK", "US" ]
}
In the examples above the JSON variety is using arrays to represent the lists whereas the XML is using a simple value. It is up to the provider to decide which is most appropriate for their own dataset.
When transmitting channel identifiers it is important to remember that channels regularly get renamed, so a balance must be between transmitting the name and/or an identifier that is unique. Introducing an identifier creates additional work to achieve the mapping but ensures that reports over time can consistently refer to the same channel.
It should also be noted that the term channel doesn't always refer to a specific channel that can be watched. There are virtual channels such as "Total TV" which refer to the sum of all channels and also virtual channels which manipulate the figures so as to represent a total figure for channels that show a stagger cast - e.g. 'ITV1 Total' might be defined as 'ITV1' + 'ITV1 HD' + 'ITV1 +1'
Many countries transmit different programmes across their broadcast regions. Sometimes the data is only treated in this regional form but it is often aggregated into a national total. These national totals are often only represented as minute/day-part data but sometimes individual programs are shown simultaneously across all or most of the regions and therefore the viewing figures are useful. A good example is a national broadcaster than has regional news slots. The bulk of the programmes are transmitted to all regions and therefore have useful national figures, but the individual news broadcasts will have individual regional figures.
Broadcast time seems like a trivial item but it turns out not to be. Firstly it is common for the schedule to not run on a 24hr basis. It is common for the schedule to start at 06:00 and run through till 05:59 on the following day, or in some countries from 02:00 to 01:59, and others from 03:00 to 02:59.
The broadcast date is usually kept consistent, so we end up using a clock that doesn't run from 00:00 to 23:59 but from 06:00 to 29:59 (or 02:00 to 25:59, 03:00 to 26:59 etc).
This use of non-standard time representation causes issues for consumers of the data - so we would already recommend including a full datetime representation (with timezone info) in addition to the local broadcast time.
An additional complexity applies to countries that employ a daylight saving technique during the year. This is when the full datetime becomes very useful for clarity and can be used to check that the 30hr representation is correct (the 30hr day ends at 29:00 and at 31:00 on the days when the clocks change).
The start and end times of programmes are usually at minute resolution, but this doesn't have to be the case - accurate feeds from the broadcaster in the form of transmission logs allow for second level resolution and the appropriate calculations with the source data. For this reason it is often not possible to exactly match which individual minutes have been included in a transmissions audience as the rules for which panel members' viewing to include are often non-trivial, and the inclusion/exclusion rules for intervals can often obscure things further.
Unlike channels the names of audience categories should never change. We provide the option to provide an ID as well as a name though as this will abstract the consumer away from underlying system changes that might affect multi-year datasets.
It is common however to migrate datasets from one audience category to another over time and this presents it's own issues as data is often not comparable.
Each country will have it's own definition of the panel and which members represent "all individuals" and the main cutoff appears to be the age at which an individual is considered to be able to make a choice - e.g. 3 years or 4 years old.
Activity types vary over time and new activity types are being introduced to represent non-linear viewing (watching on-demand services). The primary activity type will usually be live viewing but it is very common to also included VOSDAL (viewing on same day as live) as this then includes those that have DVR facilities allowing them to pause live TV and watching shortly afterwards.
Consolidated viewing is usually defined along with a number of days for which the consolidated has happened - e.g. consolidated 7 is viewing on the day, and the six following days via recording devices. With the advent of on-demand services that allow catch-up as part of their offering there is also now a standard of consolidated 28 which includes viewing 27 days after the broadcast.
The activity types that are reported are usually aggregated from much more granular reporting within the panel used to collect the data. These low level activity types might also be able to differentiate between advertising and programme content.
Each country will have a variety of providers that broadcast TV to the nation. Some will be subscription based, some will be free-to-air. The platform is used to represent how the viewing has been consumed. This might be as simple as terrestrial / satellite or go into the detail of the provider.
Is a collection of measures that can be expressed against a given minute, transmission or interval.
The total number of people in an audience category viewing a given minute, transmission or interval, often expressed in thousands or millions.
Represents the percentage of total viewers that were watching a specific minute, transmission or interval. It is calculated by dividing the audience by the total tv for the given period.
Total Viewing Rating (TVR) is a percentage of the population viewing a given minute, transmission or interval. It is calculated by dividing an audience by the universe for the given period.
The number of people within an audience category in the country or region at a specific time, usually express in millions or thousands
Is the sum of all station viewing for a given audience category within a specific period of time
HUT is the percentage of households in the universe, PUT represents the percentage of total tv of the universe
Represents the minimum and maximum values of aggregated data over time, for example aggregating minute data into 15 minutes would might result in an Audience 190.4 and a Min of 174.6 and Max of 200.6