Teletext EPG grabber manual

This is the manual page for release 2.3 of the tv_grab_ttx software package.

NAME

tv_grab_ttx - Grab TV listings from teletext through a TV card

SYNOPSIS

tv_grab_ttx [options] [file-or-device]

DESCRIPTION

This EPG grabber allows to extract TV programme listings in XMLTV format from teletext pages as broadcast by most European TV networks. The grabber works by collecting teletext pages through a TV card, locating pages holding programme schedule tables and then scraping starting times, titles and attributes from those. Additionally, the grabber follows references to further teletext pages in the overviews to extract description texts. The grabber needs to be started separately for each TV channel from which EPG data shall be collected.

Teletext data is captured directly from /dev/dvb/adapter0/demux0 or another device given as command line parameter. Alternatively, a file containing previously "dumped" VBI or raw teletext page data can be specified as input source. Use file name - to read from standard input (i.e. from a pipe)

The XMLTV output is written to standard out unless redirected to a file by use of the -outfile option. There are also "dump" options which allow to write each VBI packet or assembled teletext input pages in raw or clear text format. See "OPTIONS" for details.

The EPG output generated by this grabber is compatible to all applications which can process XML files which adhere to XMLTV DTD version 0.5. See http://xmltv.org/ for a list of applications and a copy of the DTD specification. In particular, the output can be imported into nxtvepg(1) and merged with Nextview EPG.

OPTIONS

-dev path

This option can be used to specify the device from which teletext is captured. By default /dev/dvb/adapter0/demux0 is used. Note the device can also be specified in place of an input file (i.e. without preceding -dev, if it is the last word on the command line).

-dvbpid PID

This option implies that the input device is Digital TV and specifies the PID of the data stream which contains teletext data. Note this parameter is required for DVB devices, else no data will be captured.

The teletext PID value can often be derived from the PID for video in channels.conf by adding offsets in range 3 to 30. Alternatively you can look up the PID via Internet services such as https://www.satindex.de/.

-duration seconds

This option can be used to limit capturing from the VBI device to the given number of seconds. EPG grabbing will start afterwards.

If this option is not present, capturing stops automatically after reading approximately 4 complete page cycles. This option has no effect when reading VBI data from a file.

-page NNN-MMM

This option specifies the range of the teletext page numbers in which the grabber searches for overview time tables. (Note the option does not limit the page range for description pages referenced by overviews.) The default page range is 301 to 399.

-outfile path

This option can be used to redirect the XMLTV output into a file. By default the output is written to standard out.

-chn_name name

This option can be used to set the value of the display-name tag in the channel table of the generated XMLTV output. By default the name is extracted from teletext page headers.

-chn_id id

This option can be used to set the value of the channel id attribute in the channel table of the generated XMLTV output.

By default the identifier is derived from the canonical network identifier (CNI) which is broadcast as part of VPS and PDC. The grabber uses an internal table to map these numerical values to strings in the form suggested by RFC2838 (e.g. CNI 1DC1 is mapped to ard.de)

-merge file

This option can be used to merge the newly grabber data with previously grabbed data, or with data grabbed from other channels.

Caution: This option only allows to merge input files which have been written by the same version of the grabber. This is because the grabber does not include a complete XMLTV parser, so the input is expected in the exact format in which the grabber has written it.

-expire minutes

This option can be used to omit programmes which have completed more than the given number of minutes ago. Note for programmes for which the stop time is unknown the start time plus 120 minutes is used. By default the expire time is 120 minutes.

This option is particularly intended for use together with the -merge option.

-dumpvbi

This option can be used to omit almost all processing after digitization ("slicing") of VBI data and write incoming teletext and VPS packets to the output. The output can later be read by the grabber as input file.

In the output, each data record consists of 46 bytes of which the first two contain the magazine and page number (0x000..0x7FF), the next three the header control bits including sub-page number, the next byte the packet number (0 for page header, 1..29 for teletext packets, 30 for PDC/NI packets, 32 for VPS) and finally the last 40 bytes the payload data (only 16-bit CNI value in case of VPS/PDC/NI) Note the byte-order of 16-bit values is platform-dependent.

-dumptext

This option can be used to omit grabbing and instead print all teletext pages received in the input stream as text in Latin-1 encoding. Note control characters and characters which cannot be represented in Latin-1 are replaced with spaces. This option is intended for debugging purposes only.

-dumpraw NNN-MMM

This option can be used to disable grabbing and instead write all teletext pages in the given page range to the output stream in a specific format which can later be imported again by specifying the file as input file on the command line.

The output includes the raw teletext packets for each page including all control characters. Additionally, the output contains the most a channel identification code and a timestamp which indicates the capture time. (Which are required to generate a channel name and parsing relative time and date specifications.)

-verify

This option is only intended for use in verifying the teletext grabber functionality against previously captured reference input and output data. As such the option is not of interest for the general user.

-verbose

This option can be used for enabling output of the number of each received page while capturing teletext. This allows monitoring capture progress. Output is written to the stderr stream, so it will not corrupt XMLTV output when redirecting stdout into a file.

-pgstat

This option can be used for enabling output of page reception statistics while capturing teletext. This allows monitoring capture progress. Output is written to the stderr stream, so it will not corrupt XMLTV output when redirecting stdout into a file.

-debug

This option can be used to enable output of debugging information to standard output. You can use this to get additional diagnostic messages in case of trouble with the capture device. (Note most of these messages originate directly from the "ZVBI" library libzvbi.)

Additionally this option enables messages during parsing teletext pages; these are probably only helpful for developers. Note many of these messages go to stdout and thus will corrupt the generated XMLTV output when using redirection into a file.

-version

This option prints version and license information, then exits.

-help

This option prints the command line syntax, then exits.

After all options a file name can be specified. The file can either refer to a VBI device, or a file generated by -dumpvbi or -dumpraw. The parser automatically detects the type of the input file.

GRABBING PROCESS

This chapter describes the process in which EPG information is extracted from teletext pages. This information is not needed to use the grabber but may help you to understand the goals and limitations. The main challenge of the grabber is that source data is not formatted consistently. Formats used for tables, dates and other attributes may vary widely between networks and may also vary over time or even between pages in a cycle for a specific network. So far the grabber avoids any network-specific hacks and instead attempts to address the problem algorithmically.

Before starting the actual grabbing, all teletext packets are read from the input file or stream into memory and associated with teletext page numbers. In case the same packet is received more than once, the one with the lower parity error count is selected.

If a network is using sub-pages, the capturing duration has to be long enough to cover all cycles, else there may be gaps in the overview or description pages may be missing. Many networks however use sub-pages which only differ in advertisements; in this case a single cycle is enough. Currently the program does not detect this case automatically, hence the duration is usually configured as a constant, such as 90 seconds per channel.

The first step of grabbing is identification of the overview pages which contain tables with start times and programme titles. This is done by searching all pages for lines starting with a time value, but allowing for exceptions such as markers or hidden VPS labels:

   20.00  Tagesschau UT ............ 310
 ! 20.15  Die Stein (10/13) UT ..... 324
          Zwischen Baum und Borke
          Fernsehserie, D 2008
 ! 21.05  In aller Freundschaft UT . 326
          Grossm tiges Herz (D, 2008)
 ! 21.50  Plusminus UT ............. 328

Of course there may be unrelated rows which also start with a time value. Hence the grabber builds statistics about the number of occurrences of lines where the time and title start in the same column. In the end, the format found most often is selected as the one used for detecting time tables.

In the second step, the grabber revisits all pages and extracts start times and titles from all lines which are formatted in the way selected in step 1 above. Additionally, the title lines are broken apart into time, title, episode title, feature attributes (stereo, sub-titles etc.) and references to description pages are extracted. Afterwards stop times are calculated by asserting the start of subsequent programmes as the stop time of the previous one, unless a specific end time is given at the end of a page.

Finally the date is derived. In some cases the data is printed unambiguously on top of the page, but often the data is just given as "Today", "Tomorrow" etc. or a weekday name. Even more difficult, a day on these pages sometimes is considered to span until apx. 6:00 on the next day, so finding "today" after midnight may actually mean "yesterday". The latter is resolved by comparing dates between adjacent pages.

Format of dates is understood in many different formats. Examples:

  Heute
  Sa 06-10
  29.04.06
  Sa 29.04.06
  Sa,29.04.
  Samstag, 29.April
  Sonnabend, 29.04.06
  29.04.2006

In the third step, descriptions are grabbed from all pages which were referenced by the overview tables. Often, different sub-pages of the same page are used for descriptions belonging to different programmes. Hence the grabber must make a match between the referencing programmes start time and title and the content on the description page to identify the correct sub-page. Often this gets difficult when repetitions of an episode refer to the same description page (most often without listing only the original air date). Example:

  Relic Hunter - Die Schatzj gerin
  Der magische Handschuh

  Sa 06.05 18:01-18:59           (vorbei)
  So 07.05 05:10-05:57           (vorbei)

This step also attempts to improve the programme title by correlating the position of line-breaks between overview and description pages. Additional feature attributes may also be extracted here.

If a description page looks like a formatted table, the page is reformatted into a comma-separated list. This is done because white-space and line-breaks are not preserved by the grabber and most XMLTV browsers display the text with proportional fonts, so that the table columns would not be aligned in any case.

  Darsteller:

  Jake Wade ............ Robert Taylor
  Clint Hollister ...... Richard Widmark

The above will be converted into:

  Darsteller:
  Jake Wade: Robert Taylor, Clint Hollister: Richard Widmark

In the fourth step, the channel name and identifier are derived from teletext page headers and channel identification codes. Once more statistics are used to filter out transmission errors. Since the teletext page header usually contains additional text after the actual network name (e.g. "ARDtext Sa 10.06.06" instead of "ARD") the name is derived by stripping all appendices which match known time/date formats or other common postfixes.

Finally, the data collected in all the above steps is optionally merged with an XMLTV file given by the "-merge" command line option. Then all the data is formatted in XMLTV format and printed to the output file.

FILES

/dev/vbi0

Device file from which teletext can be captured when using an analog TV card on Linux. Different paths can be selected with the -dev command line option. Depending on your Linux version, the device files may be located directly beneath /dev or inside /dev/v4l. Other operating systems such as BSD may use different names.

/dev/dvb/adapter0/demux0

Device file from which teletext can be captured when using a Digital TV device. Different paths can be selected with the -dev command line option. (If you have multiple TV cards, increment the device index after adapter to get the second card etc.)

SEE ALSO

xmltv(5), tv_cat(1), nxtvepg(1), v4lctl(1), zvbid(1), alevt(1).

AUTHOR

The application was developed by T. Zoerner (tomzo at users.sourceforge.net) between 2006 and 2011 as part of the nxtvepg project (http://nxtvepg.sourceforge.net). The official homepage is https://github.com/tomzox/tv_grab_ttx.

The application was initially implemented in Perl, using the specially developed Video::ZVBI Perl extension module (https://metacpan.org/pod/Video::ZVBI) for capturing teletext. In 2010 it was translated to C++, heavily based on the regex regular expression library of the "Boost" project (https://www.boost.org/). The latter was replaced in 2020 by std::regex which is part of C++ since release C++11.

The "ZVBI" library (http://zapping.sourceforge.net/ZVBI/index.html) is used for capturing and decoding teletext.

COPYRIGHT

Copyright 2006-2011, 2020-2021 T. Zoerner.

This is free software. You may redistribute copies of it under the terms of the GNU General Public License version 3 or later http://www.gnu.org/licenses/gpl.html. There is no warranty, to the extent permitted by law.