Teletext EPG grabber manual

This is the manual page for release 1.1 of the tv_grab_ttx software package.


NAME

tv_grab_ttx - Grab TV listings from teletext through a TV card


SYNOPSIS

tv_grab_ttx [options] [file]


DESCRIPTION

This EPG grabber allows to extract TV programme listings in XMLTV format from teletext pages as broadcast by most European TV networks. The grabber works by collecting teletext pages through a TV card, locating pages holding programme schedule tables and then scraping starting times, titles and attributes from those. Additionally, the grabber follows references to further teletext pages in the overviews to extract description texts. The grabber needs to be started separately for each TV channel from which EPG data shall be collected.

Teletext data is captured directly from /dev/vbi0 or the device given by the -dev command line parameter, unless an input file is named on the command line. In case of the latter, the grabber expects a stream of pre-processed VBI packets as generated when using option -dumpvbi. Use file name - to read from standard input (i.e. from a pipe.)

The XMLTV output is written to standard out unless redirected to a file by use of the -outfile option. There are also ``dump'' options which allow to write each VBI packet or assembled teletext input pages in raw or clear text format. See OPTIONS for details.

The EPG output generated by this grabber is compatible to all applications which can process XML files which adhere to XMLTV DTD version 0.5. See http://xmltv.org/ for a list of applications and a copy of the DTD specification. In particular, the output can be imported into nxtvepg(1) and merged with Nextview EPG.


OPTIONS

-page NNN-MMM

This limits the range of the teletext page numbers which are extracted from the input stream. At the same time this defines the range of pages which are scanned for programme schedule tables. The default page range is 300 to 399.

-outfile path

This option can be used to redirect the XMLTV output into a file. By default the output is written to standard out.

-chn_name name

This option can be used to set the value of the display-name tag in the channel table of the generated XMLTV output. By default the name is extracted from teletext page headers.

-chn_id id

This option can be used to set the value of the channel id attribute in the channel table of the generated XMLTV output.

By default the identifier is derived from the canonical network identifier (CNI) which is broadcast as part of VPS and PDC. The grabber uses an internal table to map these numerical values to strings in the form suggested by RFC2838 (e.g. CNI 1DC1 is mapped to ard.de)

-merge file

This option can be used to merge the newly grabber data with previously grabbed data, or with data grabbed from other channels.

Caution: This option only allows to merge input files which have been written by the same version of the grabber. This is because the grabber does not include a complete XMLTV parser, so the input is expected in the exact format in which the grabber has written it.

-expire minutes

This option can be used to omit programmes which have completed more than the given number of minutes ago. Note for programmes for which the stop time is unknown the start time plus 120 minutes is used. By default the expire time is 120 minutes.

This option is particularily intended for use together with the -merge option.

-dumpvbi

This option can be used to omit almost all processing after digitization (``slicing'') of VBI data and write incoming teletext and VPS packets to the output. The output can later be read by the grabber as input file.

In the output, each data record consists of 46 bytes of which the first two contain the magazine and page number (0x000..0x7FF), the next three the header control bits including sub-page number, the next byte the packet number (0 for page header, 1..29 for teletext packets, 30 for PDC/NI packets, 32 for VPS) and finally the last 40 bytes the payload data (only 16-bit CNI value in case of VPS/PDC/NI) Note the byte-order of 16-bit values is platform-dependent.

-dump

This option can be used to omit grabbing and instead print all teletext pages received in the input stream as text in Latin-1 encoding. Note control characters and characters which cannot be represented in Latin-1 are replaced with spaces. This option is intended for debugging purposes only.

-dumpraw NNN-MMM

This option can be used to disable grabbing and instead write all teletext pages in the given page range to the output stream in a specific format which can later be imported again via option -verify. The output includes the raw teletext packets for each page including all control characters. Additionally, the output contains VPS/PDC packets and a timestamp which indicates the capture time. (The latter is required when parsing relative time and date specifications.)

-verify

This option can be used to read input data in the format as written by use of option dumpraw instead of the standard VBI packet stream. As the name indicates, this mode is used in regression testing to compare the grabber output for pre-defined input page sequences with stored output files.

-dev path

This option can be used to specify the VBI device from which teletext is captured. By default /dev/vbi0 is used.

-dvbpid PID

This option enables data acquisition from a DVB device and specifies the PID which is the identifier of the data stream which contains teletext data. The PID must be determined by external means. Likewise to analog sources, the TV channel also must be tuned with an external application.

-duration seconds

This option can be used to limit capturing from the VBI device to the given number of seconds. EPG grabbing will start afterwards. If this option is not present, capturing stops automatically after a sufficient number of pages have been read. This option has no effect when reading VBI data from a file.

-verbose

This option can be used to enable output of the number of each teletext page when capturing from a VBI device. This allows monitoring the capture progress.

-debug

This option can be used to enable output of debugging information to standard output. You can use this to get additional diagnostic messages in case of trouble with the capture device. (Note most of these messages originate directly from the ``ZVBI'' library libzvbi.) Additionally this option enables messages during parsing teletext pages; these are probably only helpful for developers.

-version

This option prints version and license information, then exits.

-help

This option prints the command line syntax, then exits.


GRABBING PROCESS

This chapter describes the process in which EPG information is extracted from teletext pages. This information is not needed to use the grabber but may help you to understand the goals and limitations. The main challenge of the grabber is that source data is not formatted consistently. Formats used for tables, dates and other attributes may vary widely between networks and may also vary over time or even between pages in a cycle for a specific network. So far the grabber avoids any network-specific hacks and instead attempts to address the problem algorithmically.

Before starting the actual grabbing, all teletext packets are read from the input file or stream into memory and associated with teletext page numbers. In case the same packet is received more than once, the one with the lower parity error count is selected.

The first step of grabbing is identification of the overview pages which contain tables with start times and programme titles. This is done by searching all pages for lines starting with a time value, but allowing for exceptions such as markers or hidden VPS labels:

   20.00  Tagesschau UT ............ 310
 ! 20.15  Die Stein (10/13) UT ..... 324
          Zwischen Baum und Borke
          Fernsehserie, D 2008
 ! 21.05  In aller Freundschaft UT . 326
          Grossmuetiges Herz (D, 2008)
 ! 21.50  Plusminus UT ............. 328

Of course there may be unrelated rows which also start with a time value. Hence the grabber builds statistics about the number of occurences of lines where the time and title start in the same column. In the end, the format found most often is selected as the one used for detecting time tables.

In the second step, the grabber revisits all pages and extracts start times and titles from all lines which are formatted in the way selected in step 1 above. Additionally, the title lines are broken apart into time, title, episode title, feature attributes (stereo, sub-titles etc.), description page references. Afterwards stop times are calculated by asserting the start of subsequent programmes as the stop time of the previous one.

Finally the date is derived. In some cases the data is printed in full on top of the page, but often the data is just given as ``Today'', ``Tomorrow'' etc. or a weekday name. Even more difficult, a day on these pages sometimes is considered to span until apx. 6:00 on the next day, so finding ``today'' after midnight may actually mean ``yesterday''. The latter is resolved by comparing dates between adjacent pages.

Format of dates is understood in many different formats. Examples:

  Heute
  29.04.06
  Sa 29.04.06
  Sa,29.04.
  Samstag, 29.April
  Sonnabend, 29.04.06
  29.04.2006

In the third step, descriptions are grabbed from all pages which were referenced by the overview tables. Often, different sub-pages of the same page are used for descriptions belonging to different programmes. Hence the grabber must make a match between the referencing programmes start time and title and the content on the description page to identify the correct sub-page. Often this gets difficult when repetitions of an episode refer to the same description page (most often without listing only the original air date). Example:

  Relic Hunter - Die Schatzj{gerin
  Der magische Handschuh
  
  Sa 06.05 18:01-18:59           (vorbei)
  So 07.05 05:10-05:57           (vorbei)

In the final step, the channel name and identifier are derived from teletext page headers and channel identification codes. Once more statistics are used to filter out transmission errors. Since the teletext page header usually contains additional text after the actual network name (e.g. ``ARDtext Sa 10.06.06'' instead of ``ARD'') the name is derived by stripping all appendices which match known time/date formats or other common postfixes.

The data collected in all the above steps is then formatted in XMLTV, optionally merged with an input XMLTV file, and printed to the output file.


FILES

/dev/vbi0, /dev/v4l/vbi0

Device files from which teletext data is being read during acquisition when using an analog TV card on Linux. Different paths can be selected with the -dev command line option. Depending on your Linux version, the device files may be located directly beneath /dev or inside /dev/v4l. Other operating systems may use different names.

/dev/dvb/adapter0/demux0

Device files from which teletext data is being read during acquisition when using a DVB device, i.e. when the -dvbpid option is present. Different paths can be selected with the -dev command line option. (If you have multiple DVB cards, increment the device index after adapter to get the second card etc.)


SEE ALSO

xmltv(5), tv_cat(1), nxtvepg(1), v4lctl(1), zvbid(1), alevt(1), the Video::ZVBI(3pm) manpage, perl(1)


AUTHOR

Written by Tom Zoerner (tomzo at users.sourceforge.net) since 2006 as part of the nxtvepg project.

The official homepage is http://nxtvepg.sourceforge.net/tv_grab_ttx


COPYRIGHT

Copyright 2006-2008 Tom Zoerner.

This is free software. You may redistribute copies of it under the terms of the GNU General Public License version 3 or later http://www.gnu.org/licenses/gpl.html. There is no warranty, to the extent permitted by law.