LibLyric is tool that allows you to download song lyrics from the web
without relying on any one web site, or the HTML page structure
thereof. This tool will access the web to download the lyric, and give
you what it thinks is the best match.
To get the possible list of pages(sites) which contain the requested
song's lyrics.
After this, all the links are extracted from the HTML page
returned, and are checked for site duplicates. If there are any
pages coming from the same site, any one of them is kept whereas the
remaining are discarded.
Now, all the pages returned from the above operation are
downloaded, and stored in a temporary directory
/tmp/liblyric/p.PID, where PID is the Process ID of
the running instance of LibLyric.
These downloaded pages are not HTML tag-stripped. Also, all
scripts and comments are removed. All malformed <br> tags such
as <BR>, <BR/ >, and so on are replaced by a
single <br> tag. All <br> tags are replaced by
newlines. The operation for each page happens in parallel, so the
wait time is minimized by a fair amount.
After this, we perform an all to all 2-way approximate
intersection of these downloaded tag-stripped HTML pages. You can
look at the pages to be nodes(vertices) in a fully connected
asymmetric graph, and the weight on each edge to be the amount of
intersection(quantitative) on these two pages.
The above operation produces an intermediate file called
extents.txt. This file contains many rows, and each row
stands for a single entry. The format of each row is as follows:
Extent Size
Extent Start
Extent End
File Name
Where, File Name is the name of the file to which the extent
belongs. An extent is that block of text in any two pages which
matches approximately. The intersection of any two pages returns the
largest extent found, or nothing if none of the extents(if found)
exceed the internal throshold limit. This is done to prevent small
rogue extents from popping up.
Next, we sort the entries in this intermediate file in
descending order by the Extent Size, and remove all entries
where the Extent Start is less than 32. This operation
produces another file called ordered_exts.txt.
Now, if there are at least 2 entries in ordered_exts.txt,
the second one is extracted, and Extent Size bytes of text
starting from offset Extent Begin in the file are displayed
after passing them through some other filters. If there is just one
entry, then that is displayed, else an error saying that no
lyrics were found is displayed. The choice of using the second
entry in this file is a purely empirical one.
You can download the files on the above two pages, and place them in
the folder created after you have downloaded and extracted
liblyric. So, if you get a folder named "liblyric-X.Y.Z", you should
place the above two files in that folder. Refer to the README file for instructions on installation, and
usage.