= Technical Details for the Generalitat de Catalunya Site
INTRODUCTION
This file deals with difficulties and solutions encountered when attempting to
extract the various 'data-sets' from the Generalitat de Catalunya internet site:
http://www.intercat.gencat.es/ This site is designed to help foreign students
learn the Catalan language. The site contains a large number (about 4000) sound files
of Catalan Words and phrases being spoken. The sound files are in the RealMedia
RealPlayer format. The site also contains the textual equivalents of the sounds being
spoken and translations into 5 european languages.
From my listening of a relatively small percentage of the sound files I would say that
they are not of the highest quality, although they probably are usable. For example,
in a number of cases, I was not able to understand what the person speaking English was
saying, despite this being my Native language. (The site also contains sound files of
English Words and Phrases.
QUICK CONCLUSIONS
For extracting the data from the Web-Site the unix program 'html2text' was used.
At first the text browser 'Lynx' was used but this destroyed the HTML tables which
are used by the Web-Site to pair sound file 'textual equivalents' and their
translations into a European language. That is the site has a two column table with
Catalan Phrase and English/ French/ etc Phrase in side by side HTML Table cells.
Then the text browser 'links' was used, which does support HTML tables but it doesn't
handle 'special' european characters very well, such as 'accented' characters. It
mangles these characters. So 'html2text' was used, which does everything well.
TECHNICAL STUFF
http://www.intercat.gencat.es/guia/capitol1.htm
This is an example of an index page for the words and phrases. The page
contains a series of hyperlinks each of which is linked to a sound file.
The hyperlink text are pairs of equivalent phrases in Catalan and
english. (or other translation languages)
Ther pairs of languages available are French, German,Portugese and Spanish
all paired with Catalan.
There appear to be 21 index files (1-21) which are all named as in the command
lines below.
This is the same for the Catalan/ french pages except that the french pages
appear to be named http://www.intercat.gencat.es/guia/frances/capitol[1-21].html
where the number range in brackets is not typed literally but represents a
series of pages.
There are sound files for both the English and the Catalan words but there dont
appear to be sound files for the french words.
Urls for the wordlists on the intercat site
/guia/cat-ang.html /guia/ang-cat.html These contain about 1500 words each (presumably the
same words)
VARIOUS ATTEMPTS TO GET CLEAN WORD DATA
This section documents attempts to download the word-pairs for the 'intercat' site
in a way such that the structure of the data is preserved. The two main difficulties
were preserving the 'table' structure of the word/ translation pairs and avoiding
mangling the 'special' (accented) characters. Also, some of the html in the intercat
site is dodgy and this HTML had to be filtered. The result of these efforts is
the script 'get-intercat-data.sh'
for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/capitol${ii}.htm; done
The line above does the job absolutely spiffingly. Well done links. There appear to be
approximately 4400 words and phrases on this site, which is very very good. But I do not
know how many duplications there may be. after doing a 'sort' and 'uniq' there are still
4200 entries which is very good.
To get all the french / Catalan pages, we should be able to use
for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/frances/capitol${ii}.htm; done
And to get all the spanish (castellano) / Catalan pages, we should be able to use
for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/castella/capitol${ii}.htm; done
And for portugues and german
for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/portugues/capitol${ii}.htm; done
for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/alemany/capitol${ii}.htm; done
Each of these command lines does take some time to complete.
These lines are obsoleted now by the new 'html2text' line
Some transformations are now necessary. These can probably be done with 'tr' No because
the target (source) character strings are more than one letter. Eg
The character immediately below looks like a 'dot' or period ascii character but it
isn't. It is the Catalan double l separator. Unfortunately the links broswer destroys the
double dot abot a vowel character. It translates this as simply the vowel character itself
It may be possible to recover this information using a Catalan (monolingual) wordlist which
contains all the correct accents.
Here are some of the special characters.
· É Ú à ç è é í ï ò ó ú ü
The lines below are not working because the bash shell does not appear to like
'special' characters. When the lines below are cut and paste into a bash command
line the special characters just disappear. The answer to all this is possibly
to do some kind of pre-processing on the HTML files so that when 'links' dumps them
it doesn't mangle the special characters. I think that this is easier than trying to
do some-kind of reverse translation.
sed "s/`a/à/g"
sed "s/c,/ç/g"
Here are some HTML entities that were contained in the portugues Catalan pages. These
should be transformed into something else before 'links' has a chance to work its
black magic on them.
á
â
à
ã
ç
é
ê
è
í
ï
ó
ô
ò
õ
ú
The answer to all this hoo-haa is almost certainly to use the really nice
'html2text' unix prog which doesn't mangle special characters and which also
represents tables nicely. html2text seems to contain all sorts of nice formatting
option via the 'rc' file. bueno, adelante.
To get html2text not to underline hyper-links in its output you have to put a line
like A.attributes.external_link=NONE in the file /etc/html2textrc
This works nicely thankyou. But one more problem, there is no simple way to know
where one language phrase begins and where the other ends. In the 'links' dump
output there was two spaces, rather than one. This too should be customizable.
Some of the files from the gencat site contain reduplicated double quotation marks
like this "". This causes html2text to 'swallow' large portions of the file, where-as
lynx understands these sort of HTML mistakes.
The line below solves a number of these problems but not all. Adding the ':' doesn't really
work properly, see below for a better solution.
cat junk.html | sed "s/\"\"/\"/g;s/<\/font><\/td>/:<\/font><\/td>/g;s/
/:
/g" | \
html2text -width 130 -nobs
A Solution:
The line below cleverly puts an extra table cell in the word pair table which allows
us to more easily seperate the language border.
cat junk.html | sed "s/\"\"/\"/g;s/<\/td>/<\td>