= Technical Details for the Generalitat de Catalunya Site INTRODUCTION This file deals with difficulties and solutions encountered when attempting to extract the various 'data-sets' from the Generalitat de Catalunya internet site: http://www.intercat.gencat.es/ This site is designed to help foreign students learn the Catalan language. The site contains a large number (about 4000) sound files of Catalan Words and phrases being spoken. The sound files are in the RealMedia RealPlayer format. The site also contains the textual equivalents of the sounds being spoken and translations into 5 european languages. From my listening of a relatively small percentage of the sound files I would say that they are not of the highest quality, although they probably are usable. For example, in a number of cases, I was not able to understand what the person speaking English was saying, despite this being my Native language. (The site also contains sound files of English Words and Phrases. QUICK CONCLUSIONS For extracting the data from the Web-Site the unix program 'html2text' was used. At first the text browser 'Lynx' was used but this destroyed the HTML tables which are used by the Web-Site to pair sound file 'textual equivalents' and their translations into a European language. That is the site has a two column table with Catalan Phrase and English/ French/ etc Phrase in side by side HTML Table cells. Then the text browser 'links' was used, which does support HTML tables but it doesn't handle 'special' european characters very well, such as 'accented' characters. It mangles these characters. So 'html2text' was used, which does everything well. TECHNICAL STUFF http://www.intercat.gencat.es/guia/capitol1.htm This is an example of an index page for the words and phrases. The page contains a series of hyperlinks each of which is linked to a sound file. The hyperlink text are pairs of equivalent phrases in Catalan and english. (or other translation languages) Ther pairs of languages available are French, German,Portugese and Spanish all paired with Catalan. There appear to be 21 index files (1-21) which are all named as in the command lines below. This is the same for the Catalan/ french pages except that the french pages appear to be named http://www.intercat.gencat.es/guia/frances/capitol[1-21].html where the number range in brackets is not typed literally but represents a series of pages. There are sound files for both the English and the Catalan words but there dont appear to be sound files for the french words. Urls for the wordlists on the intercat site /guia/cat-ang.html /guia/ang-cat.html These contain about 1500 words each (presumably the same words) VARIOUS ATTEMPTS TO GET CLEAN WORD DATA This section documents attempts to download the word-pairs for the 'intercat' site in a way such that the structure of the data is preserved. The two main difficulties were preserving the 'table' structure of the word/ translation pairs and avoiding mangling the 'special' (accented) characters. Also, some of the html in the intercat site is dodgy and this HTML had to be filtered. The result of these efforts is the script 'get-intercat-data.sh' for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/capitol${ii}.htm; done The line above does the job absolutely spiffingly. Well done links. There appear to be approximately 4400 words and phrases on this site, which is very very good. But I do not know how many duplications there may be. after doing a 'sort' and 'uniq' there are still 4200 entries which is very good. To get all the french / Catalan pages, we should be able to use for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/frances/capitol${ii}.htm; done And to get all the spanish (castellano) / Catalan pages, we should be able to use for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/castella/capitol${ii}.htm; done And for portugues and german for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/portugues/capitol${ii}.htm; done for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/alemany/capitol${ii}.htm; done Each of these command lines does take some time to complete. These lines are obsoleted now by the new 'html2text' line Some transformations are now necessary. These can probably be done with 'tr' No because the target (source) character strings are more than one letter. Eg The character immediately below looks like a 'dot' or period ascii character but it isn't. It is the Catalan double l separator. Unfortunately the links broswer destroys the double dot abot a vowel character. It translates this as simply the vowel character itself It may be possible to recover this information using a Catalan (monolingual) wordlist which contains all the correct accents. Here are some of the special characters. · É Ú à ç è é í ï ò ó ú ü The lines below are not working because the bash shell does not appear to like 'special' characters. When the lines below are cut and paste into a bash command line the special characters just disappear. The answer to all this is possibly to do some kind of pre-processing on the HTML files so that when 'links' dumps them it doesn't mangle the special characters. I think that this is easier than trying to do some-kind of reverse translation. sed "s/`a/à/g" sed "s/c,/ç/g" Here are some HTML entities that were contained in the portugues Catalan pages. These should be transformed into something else before 'links' has a chance to work its black magic on them. á â à ã ç é ê è í ï   ó ô ò õ ú The answer to all this hoo-haa is almost certainly to use the really nice 'html2text' unix prog which doesn't mangle special characters and which also represents tables nicely. html2text seems to contain all sorts of nice formatting option via the 'rc' file. bueno, adelante. To get html2text not to underline hyper-links in its output you have to put a line like A.attributes.external_link=NONE in the file /etc/html2textrc This works nicely thankyou. But one more problem, there is no simple way to know where one language phrase begins and where the other ends. In the 'links' dump output there was two spaces, rather than one. This too should be customizable. Some of the files from the gencat site contain reduplicated double quotation marks like this "". This causes html2text to 'swallow' large portions of the file, where-as lynx understands these sort of HTML mistakes. The line below solves a number of these problems but not all. Adding the ':' doesn't really work properly, see below for a better solution. cat junk.html | sed "s/\"\"/\"/g;s/<\/font><\/td>/:<\/font><\/td>/g;s/
/:
/g" | \ html2text -width 130 -nobs A Solution: The line below cleverly puts an extra table cell in the word pair table which allows us to more easily seperate the language border. cat junk.html | sed "s/\"\"/\"/g;s/<\/td>/<\td>\ \ <\/td>/g" | html2text -nobs -width 140 | tr '\222' "'" | tr -d '\205' This line solves all the problems including the fact that the 'intercat' site seems to use a non-standard character for the "'" single quote character. This appears to be the character with hex-code 92. If I can translate that into octal notation then I can use 'tr' to get rid of it. ojala. Also we have to get rid of hex code 85 because its not doing anything productive as far as i can see. The command fragment | tr '\222' "'" does the trick So the new lines to get a 'cleaner' version of the data are (portugese) The line below doesn't actually work. The standard out seems to get itself in a muddle and doesn't know if it is reading or writing. See the script 'get-intercat-data.sh' for a working version for ii in $(seq 21); do wget -O- http://www.intercat.gencat.es/guia/portugues/capitol${ii}.htm | \ sed "s/\"\"/\"/g;s/<\/td>/<\td>\ \ <\/td>/g" | \ html2text -nobs -width 140 | tr '\222' "'" | tr -d '\205'; done The line below turns the clean output into a kind of somewhat dodgy XML cat wordlist-en-ca.txt | sed -e "s/[[:space:]]*$//g" -e "s/^\([[:space:]]\|\-\)*\(.*\) \{7\}\(.*\)$/\2<\/english>\3<\/catalan><\/phrase-pair>/g" -e "s/[ ]*<\/eng/<\/eng/g" > wordlist-en-ca.xml And the line below creates a 'bar delimited file'. This is the same as comma delimited but using the "|" character, on the hope that the bar is not used very much within text files. cat wordlist-po-ca.txt | sed -e "s/[[:space:]]*$//g" -e "s/^\([[:space:]]\|\-\)*\(.*\) \{7\}\(.*\)$/\2|\3/g" -e "s/[ ]*|/|/g" > wordlist-po-ca.bsv These lines take approximately 15 seconds to complete on my laptop The following line attempts to extract all the 'one word' entries in the English/Catalan phrase list. This includes phrases starting with 'the' and 'a' and ending with (to) cat wordlist-en-ca.bsv | sed 's/^[ ]*\(a\|the\)[ ]\+//g;/^[^ ]\+|/!d' | sort | uniq | wc -l A line to reformat the english Catalan vocab list as 'bsv' or bar delimited text. cat vocab-en-ca.txt | sed -e "s/[[:space:]]*$//g" -e "s/^\([[:space:]]\|\-\)*\(.*\) \{3\}\(.*\)$/\2|\3/g" -e "s/[ ]*|/|/g" > vocab-en-ca.bsv After extracting one word files from the phrase list and combining it with the vocab list we get a list of about 2200 words A line to reformat the combined lists to remove any duplicated stuff cat one-word-list-en-ca.bsv vocab-en-ca.bsv | tr '[:upper:]' '[:lower:]' | tr -d '?!' | sed "s/|[ ]*\-[ ]*/|/g;s/[ ]*,[ ]*|//g;s/[ ]*,[ ]*$//g" | sort | uniq | more --------------- A line to extract all the 'special' characters from a text file. -->> sed "s/[[:punct:]]/./g; s/[a-zA-Z0-9 ]/./g" gencat.txt | \ sed "/^[[:space:]]*$/d" | tr "." '\n' | sort | uniq | more --<< A line to create a unique list of the html entities (˜ sort of stuff) that are contained within an HTML document. -->> wget -O- http://www.intercat.gencat.es/guia/portugues/capitol4.htm | \ sed "s/^.*\(\&[a-z]\{1,13\};\).*$/\1/g" | sed "/^[^\&]/d" | sort | uniq | less --<< USING LYNX NOT LINKS It will probably be necessary to use the text only browser in order to get the URL references for the sound files since the files do not seem to be named anything particularly logical. By using the 'references section' in the output of the lynx -dump option it should be possible to link the URL refences with their associated text phrases. Then, hoopefully it should also be possible to combine this data with the output of the links -dump option to create a truly useful sound dictionary. -->> for ii in $(seq 21); do lynx -dump http://www.intercat.gencat.es/guia/capitol${ii}.htm |\ sed "s/^ *$/=/g; s/^[[:space:]]*//g"; done > gencat.txt --<< A command line to get all the gencat pages. This doesn't work particularly well because the gencat hyperlinks are not always in 'translation pairs'. It would probably be better to use 'links' rather than 'lynx' because 'links' supports tables (but can it do a 'dump'?). Yes, links can do the dump, and seems to deal with the gencat tables very well. But, by default it is not numbering the hyperlinks and it is not providing the URL's referenced by the hyper-links. Also it does some funny things with the 'special' characters (that is, european letters). This should be recoverable A line to get the url and strip unwanted stuff -->> lynx -dump -width=150 URL | expand | sed "s/^[ ]*//g; s/[ ]*$//g; s/^-[ ]*//g; /^[ ]*$/d" > junk.txt --<< A line to make a sed script out of the references section of the lynx dumpt output. This script can then be used to bind phrases to URLS, hopefully cat junk.txt | sed "1,/^Reference/d; s#^\([0-9]\{1,6\}\)\.[ ]*\(.*\)#s@\\\[\1\\\]@\\\[\2\\\]@g;#g" > temp.sed There are few tricks here. A different substition delimiter is used '@' instead of '/' so that we will not have to 'escape' all the forward slashes in the Referenced URLs. Also the character '[' and ']' both need to be escaped because other-wise they will define a character class or set instead of a literal string, which is what we want in this situation. -->> sed -f temp.sed junk.txt | sed -e "/^References/,$ d" -e "s/\([^ ]\+[ ]*\)\[/\1*[/g" | tr "*" '\n' | more --<< This line then uses the generated sed script to put the references next to the text which they relate to. This line uses the somewhat dodgy means of inserting newlines into the file by using the character * hoping that that character is not used (much) in the file. This is not ideal, but I can't seem to find a version of sed that allows you to put newlines in the right-hand side, although apparently they are available. See the script 'get-intercat-references.sh' which does all these steps at once but better. COMBINING THE REFERENCES AND THE TRANSLATIONS -->> for ii in $(cat intercat-ref-en-ca.txt | grep -E 'guia/audio[0-9]+/s?l?c' | tr ' ' '#'); do word=$(echo $ii | sed "s/^.*\]//g" | tr '#' ' '); ref=$(echo $ii | sed "s/.*\[\(.*\)\].*/\1/g"); sed -n "/$word/ s@^\(.*\)|\(.*\)@$ref||\2||\1||@gp" wordlist-en-ca.bsv | head -1; done --<< the line above is the beginnings of a script that will combine the sound file references and translation pairs. It is probably going to be quite slow. Also, this whole technique is not really necessary. I am only combining these because the javascript tutoring script requires the sound urls to be in the same array as the translations. Here is the same thing written out much better for ii in $(cat intercat-ref-en-ca.txt | \ grep -E 'guia/audio[0-9]+/s?l?c' | \ tr ' ' '#'); do word=$(echo $ii | sed "s/^.*\]//g" | tr '#' ' '); ref=$(echo $ii | sed "s/.*\[\(.*\)\].*/\1/g"); #echo "CURRENT WORD=$word"; sed -n "/$word/ s@^\(.*\)|\(.*\)@$ref||\2||\1||@gp" wordlist-en-ca.bsv | head -1; done a/audio2/c02003. /guia/audio1/slc29.ra]S