= Technical Details for the Generalitat de Catalunya Site

 INTRODUCTION

   This file deals with difficulties and solutions encountered when attempting to
   extract the various 'data-sets' from the Generalitat de Catalunya internet site:
   http://www.intercat.gencat.es/ This site is designed to help foreign students 
   learn the Catalan language. The site contains a large number (about 4000) sound files
   of Catalan Words and phrases being spoken. The sound files are in the RealMedia
   RealPlayer format. The site also contains the textual equivalents of the sounds being
   spoken and translations into 5 european languages.

   From my listening of a relatively small percentage of the sound files I would say that
   they are not of the highest quality, although they probably are usable. For example,
   in a number of cases, I was not able to understand what the person speaking English was
   saying, despite this being my Native language. (The site also contains sound files of
   English Words and Phrases.

 QUICK CONCLUSIONS
   For extracting the data from the Web-Site the unix program 'html2text' was used.
   At first the text browser 'Lynx' was used but this destroyed the HTML tables which
   are used by the Web-Site to pair sound file 'textual equivalents' and their
   translations into a European language. That is the site has a two column table with
   Catalan Phrase and English/ French/ etc Phrase in side by side HTML Table cells.

   Then the text browser 'links' was used, which does support HTML tables but it doesn't
   handle 'special' european characters very well, such as 'accented' characters. It 
   mangles these characters. So 'html2text' was used, which does everything well.

   
 TECHNICAL STUFF


    http://www.intercat.gencat.es/guia/capitol1.htm
      This is an example of an index page for the words and phrases. The page
      contains a series of hyperlinks each of which is linked to a sound file.
      The hyperlink text are pairs of equivalent phrases in Catalan and
      english. (or other translation languages)
      
      Ther pairs of languages available are French, German,Portugese and Spanish
      all paired with Catalan.

      There appear to be 21 index files (1-21) which are all named as in the command
      lines below.

      This is the same for the Catalan/ french pages except that the french pages
      appear to be named http://www.intercat.gencat.es/guia/frances/capitol[1-21].html
      where the number range in brackets is not typed literally but represents a
      series of pages.

      There are sound files for both the English and the Catalan words but there dont 
      appear to be sound files for the french words.

  Urls for the wordlists on the intercat site
    /guia/cat-ang.html  /guia/ang-cat.html These contain about 1500 words each (presumably the
    same words)
  
VARIOUS ATTEMPTS TO GET CLEAN WORD DATA
  
  This section documents attempts to download the word-pairs for the 'intercat' site
  in a way such that the structure of the data is preserved. The two main difficulties 
  were preserving the 'table' structure of the word/ translation pairs and avoiding
  mangling the 'special' (accented) characters. Also, some of the html in the intercat
  site is dodgy and this HTML had to be filtered. The result of these efforts is 
  the script 'get-intercat-data.sh'

    for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/capitol${ii}.htm; done 

     The line above does the job absolutely spiffingly. Well done links. There appear to be
     approximately 4400 words and phrases on this site, which is very very good. But I do not
     know how many duplications there may be. after doing a 'sort' and 'uniq' there are still
     4200 entries which is very good.
     
     To get all the french / Catalan pages, we should be able to use
      for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/frances/capitol${ii}.htm; done 

     And to get all the spanish (castellano) / Catalan pages, we should be able to use
      for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/castella/capitol${ii}.htm; done 

     And for portugues and german
      for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/portugues/capitol${ii}.htm; done 
      for ii in $(seq 21); do links -dump http://www.intercat.gencat.es/guia/alemany/capitol${ii}.htm; done 

     Each of these command lines does take some time to complete.
     These lines are obsoleted now by the new 'html2text' line

     Some transformations are now necessary. These can probably be done with 'tr' No because
     the target (source) character strings are more than one letter. Eg

     The character immediately below looks like a 'dot' or period ascii character but it
     isn't. It is the Catalan double l separator. Unfortunately the links broswer destroys the
     double dot abot a vowel character. It translates this as simply the vowel character itself
     It may be possible to recover this information using a Catalan (monolingual) wordlist which
     contains all the correct accents.
 
     Here are some of the special characters.
     · É Ú à ç è é í ï ò ó ú ü
     
     The lines below are not working because the bash shell does not appear to like 
     'special' characters. When the lines below are cut and paste into a bash command
     line the special characters just disappear. The answer to all this is possibly 
     to do some kind of pre-processing on the HTML files so that when 'links' dumps them
     it doesn't mangle the special characters. I think that this is easier than trying to
     do some-kind of reverse translation.

     sed "s/`a/à/g"
     sed "s/c,/ç/g"
     
     Here are some HTML entities that were contained in the portugues Catalan pages. These 
     should be transformed into something else before 'links' has a chance to work its 
     black magic on them.

     &aacute;
     &acirc;
     &agrave;
     &atilde;
     &ccedil;
     &eacute;
     &ecirc;
     &egrave;
     &iacute;
     &iuml;
     &nbsp;
     &oacute;
     &ocirc;
     &ograve;
     &otilde;
     &uacute;
     
     The answer to all this hoo-haa is almost certainly to use the really nice
     'html2text' unix prog which doesn't mangle special characters and which also
     represents tables nicely. html2text seems to contain all sorts of nice formatting
     option via the 'rc' file. bueno, adelante.

     To get html2text not to underline hyper-links in its output you have to put a line
     like A.attributes.external_link=NONE in the file /etc/html2textrc
     This works nicely thankyou. But one more problem, there is no simple way to know
     where one language phrase begins and where the other ends. In the 'links' dump
     output there was two spaces, rather than one. This too should be customizable.

     Some of the files from the gencat site contain reduplicated double quotation marks
     like this "". This causes html2text to 'swallow' large portions of the file, where-as
     lynx understands these sort of HTML mistakes. 

     The line below solves a number of these problems but not all. Adding the ':' doesn't really
     work properly, see below for a better solution.
      cat junk.html | sed "s/\"\"/\"/g;s/<\/font><\/td>/:<\/font><\/td>/g;s/<br>/:<br>/g" | \
       html2text -width 130 -nobs

     A Solution:
     The line below cleverly puts an extra table cell in the word pair table which allows
     us to more easily seperate the language border.
       
     cat junk.html | sed "s/\"\"/\"/g;s/<\/td>/<\td><td>\&nbsp;\&nbsp;<\/td>/g" | html2text -nobs -width 140 | tr '\222' "'" | tr -d '\205' 
       This line solves all the problems including the fact that the 'intercat' site seems to 
       use a non-standard character for the "'" single quote character. This appears to be the 
       character with hex-code 92. If I can translate that into octal notation then I can use
       'tr' to get rid of it. ojala. Also we have to get rid of hex code 85 because its not doing
       anything productive as far as i can see.

     The command fragment | tr '\222' "'"       does the trick
     So the new lines to get a 'cleaner' version of the data are (portugese)
     
     The line below doesn't actually work. The standard out seems to get itself in a muddle and doesn't
     know if it is reading or writing. See the script 'get-intercat-data.sh' for a working version

      for ii in $(seq 21); do wget -O- http://www.intercat.gencat.es/guia/portugues/capitol${ii}.htm | \
        sed "s/\"\"/\"/g;s/<\/td>/<\td><td>\&nbsp;\&nbsp;<\/td>/g" | \
	html2text -nobs -width 140 | tr '\222' "'" | tr -d '\205'; done 

     The line below turns the clean output into a kind of somewhat dodgy XML
      cat wordlist-en-ca.txt | sed -e "s/[[:space:]]*$//g" -e "s/^\([[:space:]]\|\-\)*\(.*\) \{7\}\(.*\)$/<phrase-pair><english>\2<\/english><catalan>\3<\/catalan><\/phrase-pair>/g" -e "s/[ ]*<\/eng/<\/eng/g" > wordlist-en-ca.xml

     And the line below creates a 'bar delimited file'. This is the same as comma delimited but using
     the "|" character, on the hope that the bar is not used very much within text files.

     cat wordlist-po-ca.txt | sed -e "s/[[:space:]]*$//g" -e "s/^\([[:space:]]\|\-\)*\(.*\) \{7\}\(.*\)$/\2|\3/g" -e "s/[ ]*|/|/g" > wordlist-po-ca.bsv
     These lines take approximately 15 seconds to complete on my laptop

      The following line attempts to extract all the 'one word' entries in the English/Catalan
      phrase list. This includes phrases starting with 'the' and 'a' and ending with (to)

      cat wordlist-en-ca.bsv | sed 's/^[ ]*\(a\|the\)[ ]\+//g;/^[^ ]\+|/!d' | sort | uniq | wc -l
      
      A line to reformat the english Catalan vocab list as 'bsv' or bar delimited text.
       cat vocab-en-ca.txt | sed -e "s/[[:space:]]*$//g" -e "s/^\([[:space:]]\|\-\)*\(.*\) \{3\}\(.*\)$/\2|\3/g" -e "s/[ ]*|/|/g" > vocab-en-ca.bsv
      
     After extracting one word files from the phrase list and combining it with the vocab list
     we get a list of about 2200 words

     A line to reformat the combined lists to remove any duplicated stuff
      cat one-word-list-en-ca.bsv vocab-en-ca.bsv | tr '[:upper:]' '[:lower:]' | tr -d '?!' | sed "s/|[ ]*\-[ ]*/|/g;s/[ ]*,[ ]*|//g;s/[ ]*,[ ]*$//g" | sort | uniq | more


     ---------------  
     A line to extract all the 'special' characters from a text file.
   -->>  
     sed "s/[[:punct:]]/./g; s/[a-zA-Z0-9 ]/./g" gencat.txt | \
      sed "/^[[:space:]]*$/d" | tr "." '\n' | sort | uniq | more
   --<<

     A line to create a unique list of the html entities (&tilde; sort of stuff) that
     are contained within an HTML document.
   -->>
      wget -O- http://www.intercat.gencat.es/guia/portugues/capitol4.htm | \
        sed "s/^.*\(\&[a-z]\{1,13\};\).*$/\1/g" | sed "/^[^\&]/d" | sort | uniq | less 
   --<<
 USING LYNX NOT LINKS

    It will probably be necessary to use the text only browser in order to get the URL references
    for the sound files since the files do not seem to be named anything particularly logical.
    By using the 'references section' in the output of the lynx -dump option it should be 
    possible to link the URL refences with their associated text phrases. Then, hoopefully 
    it should also be possible to combine this data with the output of the links -dump
    option to create a truly useful sound dictionary.

    -->>
    for ii in $(seq 21); do lynx -dump http://www.intercat.gencat.es/guia/capitol${ii}.htm |\
      sed "s/^ *$/=/g; s/^[[:space:]]*//g"; done > gencat.txt
    --<<
    
     A command line to get all the gencat pages. This doesn't work particularly well because
     the gencat hyperlinks are not always in 'translation pairs'. It would probably be 
     better to use 'links' rather than 'lynx' because 'links' supports tables (but 
     can it do a 'dump'?). Yes, links can do the dump, and seems to deal with the 
     gencat tables very well. But, by default it is not numbering the hyperlinks and
     it is not providing the URL's referenced by the hyper-links. Also it does some funny
     things with the 'special' characters (that is, european letters). This should be 
     recoverable

   A line to get the url and strip unwanted stuff
   -->>
   lynx -dump -width=150 URL | expand | sed "s/^[ ]*//g; s/[ ]*$//g; s/^-[ ]*//g; /^[ ]*$/d" > junk.txt
   --<<
   A line to make a sed script out of the references section of the lynx dumpt output.
   This script can then be used to bind phrases to URLS, hopefully
    cat junk.txt |  sed "1,/^Reference/d; s#^\([0-9]\{1,6\}\)\.[ ]*\(.*\)#s@\\\[\1\\\]@\\\[\2\\\]@g;#g" > temp.sed

   There are few tricks here. A different substition delimiter is used '@' instead of '/' so that
   we will not have to 'escape' all the forward slashes in the Referenced URLs. Also the character
   '[' and ']' both need to be escaped because other-wise they will define a character class or set
   instead of a literal string, which is what we want in this situation.
   -->>
    sed -f temp.sed  junk.txt | sed -e "/^References/,$ d" -e "s/\([^ ]\+[ ]*\)\[/\1*[/g" | tr "*" '\n' | more
   --<<
   
   This line then uses the generated sed script to put the references next to the text
    which they relate to. This line uses the somewhat dodgy means of inserting newlines into
    the file by using the character * hoping that that character is not used (much) in the
    file. This is not ideal, but I can't seem to find a version of sed that allows you to
    put newlines in the right-hand side, although apparently they are available.

   See the script 'get-intercat-references.sh' which does all these steps at once but better.

COMBINING THE REFERENCES AND THE TRANSLATIONS

  -->>
for ii in $(cat intercat-ref-en-ca.txt | grep -E 'guia/audio[0-9]+/s?l?c' | tr ' ' '#'); do word=$(echo $ii | sed "s/^.*\]//g" | tr '#' ' '); ref=$(echo $ii | sed "s/.*\[\(.*\)\].*/\1/g"); sed -n "/$word/ s@^\(.*\)|\(.*\)@$ref||\2||\1||@gp" wordlist-en-ca.bsv | head -1; done 
 --<<

   the line above is the beginnings of a script that will combine the sound file
   references and translation pairs. It is probably going to be quite slow. Also, this whole 
   technique is not really necessary. I am only combining these because the javascript tutoring
   script requires the sound urls to be in the same array as the translations.
   
   Here is the same thing written out much better

   for ii in $(cat intercat-ref-en-ca.txt | \
      grep -E 'guia/audio[0-9]+/s?l?c' | \
      tr ' ' '#');
   do
     word=$(echo $ii | sed "s/^.*\]//g" | tr '#' ' '); 
     ref=$(echo $ii | sed "s/.*\[\(.*\)\].*/\1/g");
     #echo "CURRENT WORD=$word";
     sed -n "/$word/ s@^\(.*\)|\(.*\)@$ref||\2||\1||@gp" wordlist-en-ca.bsv | head -1;
   done


   a/audio2/c02003.
/guia/audio1/slc29.ra]S