LibLyric - Lyrics Library



Home



LibLyric is tool that allows you to download song lyrics from the web without relying on any one web site, or the HTML page structure thereof. This tool will access the web to download the lyric, and give you what it thinks is the best match.

How does LibLyric work?

  1. LibLyric contacts one of the web search engines:
    1. http://www.ask.com [default]
    2. http://www.yahoo.com
    3. http://www.google.com
    4. http://www.dogpile.com
    To get the possible list of pages(sites) which contain the requested song's lyrics.

  2. After this, all the links are extracted from the HTML page returned, and are checked for site duplicates. If there are any pages coming from the same site, any one of them is kept whereas the remaining are discarded.

  3. Now, all the pages returned from the above operation are downloaded, and stored in a temporary directory /tmp/liblyric/p.PID, where PID is the Process ID of the running instance of LibLyric.

  4. These downloaded pages are not HTML tag-stripped. Also, all scripts and comments are removed. All malformed <br> tags such as <BR>, <BR/  >, and so on are replaced by a single <br> tag. All <br> tags are replaced by newlines. The operation for each page happens in parallel, so the wait time is minimized by a fair amount.

  5. After this, we perform an all to all 2-way approximate intersection of these downloaded tag-stripped HTML pages. You can look at the pages to be nodes(vertices) in a fully connected asymmetric graph, and the weight on each edge to be the amount of intersection(quantitative) on these two pages.

  6. The above operation produces an intermediate file called extents.txt. This file contains many rows, and each row stands for a single entry. The format of each row is as follows:
    Extent Size Extent Start Extent End File Name
    Where, File Name is the name of the file to which the extent belongs. An extent is that block of text in any two pages which matches approximately. The intersection of any two pages returns the largest extent found, or nothing if none of the extents(if found) exceed the internal throshold limit. This is done to prevent small rogue extents from popping up.

  7. Next, we sort the entries in this intermediate file in descending order by the Extent Size, and remove all entries where the Extent Start is less than 32. This operation produces another file called ordered_exts.txt.

  8. Now, if there are at least 2 entries in ordered_exts.txt, the second one is extracted, and Extent Size bytes of text starting from offset Extent Begin in the file are displayed after passing them through some other filters. If there is just one entry, then that is displayed, else an error saying that no lyrics were found is displayed. The choice of using the second entry in this file is a purely empirical one.



Dependencies:

  1. CURL Homepage: http://curl.haxx.se/download.html
  2. UNHTML Download page: http://packages.debian.org/unstable/source/unhtml

You can download the files on the above two pages, and place them in the folder created after you have downloaded and extracted liblyric. So, if you get a folder named "liblyric-X.Y.Z", you should place the above two files in that folder. Refer to the README file for instructions on installation, and usage.


Download LibLyric

You can download LibLyric-0.0.2 here.


1
Hosted by www.Geocities.ws