Teletext EPG grabber manual
This is the manual page for release 2.3 of the tv_grab_ttx software package.
NAME
tv_grab_ttx - Grab TV listings from teletext through a TV card
SYNOPSIS
tv_grab_ttx [options] [file-or-device]
DESCRIPTION
This EPG grabber allows to extract TV programme listings in XMLTV format from teletext pages as broadcast by most European TV networks. The grabber works by collecting teletext pages through a TV card, locating pages holding programme schedule tables and then scraping starting times, titles and attributes from those. Additionally, the grabber follows references to further teletext pages in the overviews to extract description texts. The grabber needs to be started separately for each TV channel from which EPG data shall be collected.
Teletext data is captured directly from /dev/dvb/adapter0/demux0 or another device given as command line parameter. Alternatively, a file containing previously "dumped" VBI or raw teletext page data can be specified as input source. Use file name -
to read from standard input (i.e. from a pipe)
The XMLTV output is written to standard out unless redirected to a file by use of the -outfile option. There are also "dump" options which allow to write each VBI packet or assembled teletext input pages in raw or clear text format. See "OPTIONS" for details.
The EPG output generated by this grabber is compatible to all applications which can process XML files which adhere to XMLTV DTD version 0.5. See http://xmltv.org/ for a list of applications and a copy of the DTD specification. In particular, the output can be imported into nxtvepg(1) and merged with Nextview EPG.
OPTIONS
- -dev path
-
This option can be used to specify the device from which teletext is captured. By default /dev/dvb/adapter0/demux0 is used. Note the device can also be specified in place of an input file (i.e. without preceding
-dev
, if it is the last word on the command line). - -dvbpid PID
-
This option implies that the input device is Digital TV and specifies the PID of the data stream which contains teletext data. Note this parameter is required for DVB devices, else no data will be captured.
The teletext PID value can often be derived from the PID for video in
channels.conf
by adding offsets in range 3 to 30. Alternatively you can look up the PID via Internet services such as https://www.satindex.de/. - -duration seconds
-
This option can be used to limit capturing from the VBI device to the given number of seconds. EPG grabbing will start afterwards.
If this option is not present, capturing stops automatically after reading approximately 4 complete page cycles. This option has no effect when reading VBI data from a file.
- -page NNN-MMM
-
This option specifies the range of the teletext page numbers in which the grabber searches for overview time tables. (Note the option does not limit the page range for description pages referenced by overviews.) The default page range is 301 to 399.
- -outfile path
-
This option can be used to redirect the XMLTV output into a file. By default the output is written to standard out.
- -chn_name name
-
This option can be used to set the value of the display-name tag in the channel table of the generated XMLTV output. By default the name is extracted from teletext page headers.
- -chn_id id
-
This option can be used to set the value of the channel id attribute in the channel table of the generated XMLTV output.
By default the identifier is derived from the canonical network identifier (CNI) which is broadcast as part of VPS and PDC. The grabber uses an internal table to map these numerical values to strings in the form suggested by RFC2838 (e.g. CNI
1DC1
is mapped toard.de
) - -merge file
-
This option can be used to merge the newly grabber data with previously grabbed data, or with data grabbed from other channels.
Caution: This option only allows to merge input files which have been written by the same version of the grabber. This is because the grabber does not include a complete XMLTV parser, so the input is expected in the exact format in which the grabber has written it.
- -expire minutes
-
This option can be used to omit programmes which have completed more than the given number of minutes ago. Note for programmes for which the stop time is unknown the start time plus 120 minutes is used. By default the expire time is 120 minutes.
This option is particularly intended for use together with the -merge option.
- -dumpvbi
-
This option can be used to omit almost all processing after digitization ("slicing") of VBI data and write incoming teletext and VPS packets to the output. The output can later be read by the grabber as input file.
In the output, each data record consists of 46 bytes of which the first two contain the magazine and page number (0x000..0x7FF), the next three the header control bits including sub-page number, the next byte the packet number (0 for page header, 1..29 for teletext packets, 30 for PDC/NI packets, 32 for VPS) and finally the last 40 bytes the payload data (only 16-bit CNI value in case of VPS/PDC/NI) Note the byte-order of 16-bit values is platform-dependent.
- -dumptext
-
This option can be used to omit grabbing and instead print all teletext pages received in the input stream as text in Latin-1 encoding. Note control characters and characters which cannot be represented in Latin-1 are replaced with spaces. This option is intended for debugging purposes only.
- -dumpraw NNN-MMM
-
This option can be used to disable grabbing and instead write all teletext pages in the given page range to the output stream in a specific format which can later be imported again by specifying the file as input file on the command line.
The output includes the raw teletext packets for each page including all control characters. Additionally, the output contains the most a channel identification code and a timestamp which indicates the capture time. (Which are required to generate a channel name and parsing relative time and date specifications.)
- -verify
-
This option is only intended for use in verifying the teletext grabber functionality against previously captured reference input and output data. As such the option is not of interest for the general user.
- -verbose
-
This option can be used for enabling output of the number of each received page while capturing teletext. This allows monitoring capture progress. Output is written to the stderr stream, so it will not corrupt XMLTV output when redirecting stdout into a file.
- -pgstat
-
This option can be used for enabling output of page reception statistics while capturing teletext. This allows monitoring capture progress. Output is written to the stderr stream, so it will not corrupt XMLTV output when redirecting stdout into a file.
- -debug
-
This option can be used to enable output of debugging information to standard output. You can use this to get additional diagnostic messages in case of trouble with the capture device. (Note most of these messages originate directly from the "ZVBI" library libzvbi.)
Additionally this option enables messages during parsing teletext pages; these are probably only helpful for developers. Note many of these messages go to stdout and thus will corrupt the generated XMLTV output when using redirection into a file.
- -version
-
This option prints version and license information, then exits.
- -help
-
This option prints the command line syntax, then exits.
After all options a file name can be specified. The file can either refer to a VBI device, or a file generated by -dumpvbi or -dumpraw. The parser automatically detects the type of the input file.
GRABBING PROCESS
This chapter describes the process in which EPG information is extracted from teletext pages. This information is not needed to use the grabber but may help you to understand the goals and limitations. The main challenge of the grabber is that source data is not formatted consistently. Formats used for tables, dates and other attributes may vary widely between networks and may also vary over time or even between pages in a cycle for a specific network. So far the grabber avoids any network-specific hacks and instead attempts to address the problem algorithmically.
Before starting the actual grabbing, all teletext packets are read from the input file or stream into memory and associated with teletext page numbers. In case the same packet is received more than once, the one with the lower parity error count is selected.
If a network is using sub-pages, the capturing duration has to be long enough to cover all cycles, else there may be gaps in the overview or description pages may be missing. Many networks however use sub-pages which only differ in advertisements; in this case a single cycle is enough. Currently the program does not detect this case automatically, hence the duration is usually configured as a constant, such as 90 seconds per channel.
The first step of grabbing is identification of the overview pages which contain tables with start times and programme titles. This is done by searching all pages for lines starting with a time value, but allowing for exceptions such as markers or hidden VPS labels:
20.00 Tagesschau UT ............ 310
! 20.15 Die Stein (10/13) UT ..... 324
Zwischen Baum und Borke
Fernsehserie, D 2008
! 21.05 In aller Freundschaft UT . 326
Grossm tiges Herz (D, 2008)
! 21.50 Plusminus UT ............. 328
Of course there may be unrelated rows which also start with a time value. Hence the grabber builds statistics about the number of occurrences of lines where the time and title start in the same column. In the end, the format found most often is selected as the one used for detecting time tables.
In the second step, the grabber revisits all pages and extracts start times and titles from all lines which are formatted in the way selected in step 1 above. Additionally, the title lines are broken apart into time, title, episode title, feature attributes (stereo, sub-titles etc.) and references to description pages are extracted. Afterwards stop times are calculated by asserting the start of subsequent programmes as the stop time of the previous one, unless a specific end time is given at the end of a page.
Finally the date is derived. In some cases the data is printed unambiguously on top of the page, but often the data is just given as "Today", "Tomorrow" etc. or a weekday name. Even more difficult, a day on these pages sometimes is considered to span until apx. 6:00 on the next day, so finding "today" after midnight may actually mean "yesterday". The latter is resolved by comparing dates between adjacent pages.
Format of dates is understood in many different formats. Examples:
Heute
Sa 06-10
29.04.06
Sa 29.04.06
Sa,29.04.
Samstag, 29.April
Sonnabend, 29.04.06
29.04.2006
In the third step, descriptions are grabbed from all pages which were referenced by the overview tables. Often, different sub-pages of the same page are used for descriptions belonging to different programmes. Hence the grabber must make a match between the referencing programmes start time and title and the content on the description page to identify the correct sub-page. Often this gets difficult when repetitions of an episode refer to the same description page (most often without listing only the original air date). Example:
Relic Hunter - Die Schatzj gerin
Der magische Handschuh
Sa 06.05 18:01-18:59 (vorbei)
So 07.05 05:10-05:57 (vorbei)
This step also attempts to improve the programme title by correlating the position of line-breaks between overview and description pages. Additional feature attributes may also be extracted here.
If a description page looks like a formatted table, the page is reformatted into a comma-separated list. This is done because white-space and line-breaks are not preserved by the grabber and most XMLTV browsers display the text with proportional fonts, so that the table columns would not be aligned in any case.
Darsteller:
Jake Wade ............ Robert Taylor
Clint Hollister ...... Richard Widmark
The above will be converted into:
Darsteller:
Jake Wade: Robert Taylor, Clint Hollister: Richard Widmark
In the fourth step, the channel name and identifier are derived from teletext page headers and channel identification codes. Once more statistics are used to filter out transmission errors. Since the teletext page header usually contains additional text after the actual network name (e.g. "ARDtext Sa 10.06.06" instead of "ARD") the name is derived by stripping all appendices which match known time/date formats or other common postfixes.
Finally, the data collected in all the above steps is optionally merged with an XMLTV file given by the "-merge" command line option. Then all the data is formatted in XMLTV format and printed to the output file.
FILES
- /dev/vbi0
-
Device file from which teletext can be captured when using an analog TV card on Linux. Different paths can be selected with the -dev command line option. Depending on your Linux version, the device files may be located directly beneath
/dev
or inside/dev/v4l
. Other operating systems such as BSD may use different names. - /dev/dvb/adapter0/demux0
-
Device file from which teletext can be captured when using a Digital TV device. Different paths can be selected with the -dev command line option. (If you have multiple TV cards, increment the device index after adapter to get the second card etc.)
SEE ALSO
xmltv(5), tv_cat(1), nxtvepg(1), v4lctl(1), zvbid(1), alevt(1).
AUTHOR
The application was developed by T. Zoerner (tomzo at users.sourceforge.net) between 2006 and 2011 as part of the nxtvepg project (http://nxtvepg.sourceforge.net). The official homepage is https://github.com/tomzox/tv_grab_ttx.
The application was initially implemented in Perl, using the specially developed Video::ZVBI Perl extension module (https://metacpan.org/pod/Video::ZVBI) for capturing teletext. In 2010 it was translated to C++, heavily based on the regex regular expression library of the "Boost" project (https://www.boost.org/). The latter was replaced in 2020 by std::regex which is part of C++ since release C++11.
The "ZVBI" library (http://zapping.sourceforge.net/ZVBI/index.html) is used for capturing and decoding teletext.
COPYRIGHT
Copyright 2006-2011, 2020-2021 T. Zoerner.
This is free software. You may redistribute copies of it under the terms of the GNU General Public License version 3 or later http://www.gnu.org/licenses/gpl.html. There is no warranty, to the extent permitted by law.