SAP Metaphone in ABAP
This ABAP effort is based
on: http://www.wbrogden.com/
by Bill
Brogden
http://www.wbrogden.com/phonetic/index.html
-- has Java code I translated.
I wrote a SAP ABAP Function module ZJNC_PARSE_ARTDESC which
tokenizes any description into max 3 metaphones and max 2 numbers.
It uses a shared memory class to weed out noise
words like "the" "for" ....
It is a SAP NW2004s+ compatible to extent of use of Regular Expressions
feature.
My objective was to search for Retail articles covering Books, Music & Standard Retail stuff.
I used this to create a SAP database table ZMETAPHONE from SAP's material Description table.
It also puts a tag for 3 business areas - Standard Retail, Music, Books.
Then in the final Query screen I displayed a ranked search.
Searching for "Cocktail Bible" "Beatles hits" "Colgate 100gm" "Surf Excel" "Excel Surf" "Basmati Rice 5kg"
got me excellent results with good response times.
SAP has a standard facility for Advanced Text search that is very well integrated in ERP & BI.
TREX is enabled for many Business objects in Master data.
https://www.sdn.sap.com/irj/sdn/go/portal/prtroot/docs/library/uuid/a751a1ec-0a01-0010-f0ba-89e4c5cd0261
shows how easy it is to use. But I wanted something simple and easy to control.
For me the numeric parts are very important in product search.
I had a look at publicly available "Double Metaphone" which I found unsuitable for essentially English English India.
Also having primary & secondary codes is not suitable for creating a Database table.
4.x Code no REGEX
FUNCTION
zjnc_parse_artdesc.
*"----------------------------------------------------------------------
*"*"Local
interface:
*" IMPORTING
*"
REFERENCE(INPTEXT) TYPE CHAR80
*"
EXPORTING
*" REFERENCE(METAPHONE1) TYPE
CHAR4
*" REFERENCE(METAPHONE2) TYPE
CHAR4
*" REFERENCE(METAPHONE3) TYPE
CHAR4
*" REFERENCE(NUMBER1) TYPE
NUM8
*" REFERENCE(NUMBER2) TYPE
NUM8
*"----------------------------------------------------------------------
DATA :
hit(1) TYPE
c,
wword TYPE
char5,
len TYPE
i,
off TYPE i.
TYPES: BEGIN OF
t_token,
token(40) TYPE c,
END OF t_token.
DATA: it_token TYPE STANDARD TABLE OF t_token.
DATA: wa_token TYPE t_token.
FIELD-SYMBOLS: <fs_token> TYPE t_token.
DATA: wnumber TYPE string.
TYPES: BEGIN OF
t_number,
numlen TYPE
i,
number(8) TYPE n,
END OF t_number.
DATA: it_number TYPE STANDARD TABLE OF t_number.
DATA: wa_number TYPE t_number.
FIELD-SYMBOLS: <fs_number> TYPE t_number.
TYPES: BEGIN OF
t_word,
wlen TYPE
i,
word(16) TYPE c,
END OF t_word.
DATA: it_word TYPE STANDARD TABLE OF t_word.
DATA: wa_word TYPE t_word.
FIELD-SYMBOLS: <fs_word> TYPE t_word.
CLEAR: metaphone1, metaphone2, metaphone3, number1, number2.
SPLIT inptext AT
space INTO TABLE it_token.
LOOP AT it_token ASSIGNING <fs_token>.
len = STRLEN( <fs_token>-token ).
IF len =
0.
CONTINUE.
ENDIF.
TRANSLATE <fs_token>-token TO UPPER CASE.
" Remove
other than A-Z 0-9
PERFORM f_clean_token CHANGING
<fs_token>-token.
PERFORM
f_extract_number USING
<fs_token>-token
CHANGING wnumber.
len =
STRLEN( wnumber ).
IF len > 0 AND len <
9.
MOVE wnumber TO
wa_number-number.
MOVE len TO
wa_number-numlen.
APPEND wa_number TO
it_number.
ENDIF.
" Remove
other than A-Z
PERFORM f_retain_alpha CHANGING
<fs_token>-token.
len = STRLEN( <fs_token>-token ).
IF len > 2.
CASE
<fs_token>-token.
WHEN 'ALL'
OR 'AND' OR 'ARE' OR 'BIG' OR 'BOOK' OR
'FOR'.
MOVE 'Y' TO
hit.
WHEN 'FROM' OR 'HOW' OR 'NEW'
OR 'NOT' OR 'NOW' OR
'OUT'.
MOVE 'Y' TO
hit.
WHEN 'SET' OR 'THAT' OR 'THE'
OR 'USE' OR 'WHEN'.
MOVE 'Y' TO hit.
WHEN 'WHO' OR
'WHY' OR 'WILL' OR 'WITH' OR
'YOU'.
MOVE 'Y' TO
hit.
WHEN
OTHERS.
MOVE 'N' TO
hit.
ENDCASE.
IF hit =
'N'.
MOVE <fs_token>-token
TO wa_word-word.
IF len >
16.
MOVE 16 TO
len.
ENDIF.
MOVE len TO
wa_word-wlen.
APPEND wa_word TO
it_word.
ENDIF.
ENDIF.
ENDLOOP.
SORT it_number BY
numlen DESCENDING.
SORT it_word BY wlen
DESCENDING.
LOOP AT it_number
ASSIGNING <fs_number>.
CASE
sy-tabix.
WHEN
1.
MOVE <fs_number>-number
TO number1.
WHEN
2.
MOVE <fs_number>-number
TO number2.
WHEN
OTHERS.
EXIT.
ENDCASE.
ENDLOOP.
LOOP AT it_word
ASSIGNING <fs_word>.
CASE
sy-tabix.
WHEN
1.
PERFORM f_metaphone USING
<fs_word>-word <fs_word>-wlen
metaphone1.
WHEN
2.
PERFORM f_metaphone USING
<fs_word>-word <fs_word>-wlen
metaphone2.
WHEN
3.
PERFORM f_metaphone USING
<fs_word>-word <fs_word>-wlen
metaphone3.
WHEN
OTHERS.
EXIT.
ENDCASE.
ENDLOOP.
ENDFUNCTION.
*&---------------------------------------------------------------------*
*&
Form
f_clean_token
*&---------------------------------------------------------------------*
FORM
f_clean_token CHANGING p_token.
DATA: len TYPE
i,
ptr
TYPE i,
onechar TYPE
c,
o_token TYPE string.
CLEAR: ptr,
o_token.
len = STRLEN( p_token ).
DO len
TIMES.
onechar = p_token+ptr(1).
IF
( onechar <= 'Z'
AND
onechar >= 'A' )
OR ( onechar <=
'9'
AND onechar >= '0'
).
CONCATENATE o_token onechar INTO
o_token.
ENDIF.
ADD 1 TO
ptr.
ENDDO.
p_token = o_token.
ENDFORM. "f_clean_token
*&---------------------------------------------------------------------*
*&
Form
f_retain_alpha
*&---------------------------------------------------------------------*
FORM
f_retain_alpha CHANGING p_token.
DATA: len
TYPE i,
ptr TYPE
i,
onechar TYPE
c,
o_token TYPE string.
CLEAR: ptr,
o_token.
len = STRLEN( p_token ).
DO len
TIMES.
onechar = p_token+ptr(1).
IF
( onechar <= 'Z'
AND
onechar >= 'A' ).
CONCATENATE o_token
onechar INTO o_token.
ENDIF.
ADD 1
TO ptr.
ENDDO.
p_token = o_token.
ENDFORM.
"f_retain_alpha
*&---------------------------------------------------------------------*
*&
Form
f_extract_number
*&---------------------------------------------------------------------*
FORM
f_extract_number USING p_token CHANGING p_number.
DATA:
len TYPE
i,
ptr
TYPE i,
onechar TYPE
c,
o_token TYPE string.
CLEAR: ptr,
o_token.
len = STRLEN( p_token ).
DO len
TIMES.
onechar = p_token+ptr(1).
IF
( onechar <= '9'
AND
onechar >= '0' ).
CONCATENATE o_token
onechar INTO o_token.
ENDIF.
ADD 1
TO ptr.
ENDDO.
p_number = o_token.
ENDFORM. "f_extract_number
*&--------------------------------------------------------------------*
*&
Form
f_metaphone
*&--------------------------------------------------------------------*
FORM
f_metaphone USING inpword TYPE
c
inpwlen TYPE
i
CHANGING metaphone TYPE c.
DATA:
inpoff TYPE
i,
inofp1
TYPE i,
inofp2 TYPE
i,
inofm1
TYPE i,
wlocal(40) TYPE
c,
wmetphn(8) TYPE
c,
wrdsiz
TYPE i,
outoff TYPE
i,
hard(1) TYPE
c,
inpwlm1 TYPE
i,
inpchr(1) TYPE
c.
CONSTANTS:
maxcodelen TYPE i VALUE 4.
CLEAR: inpoff, outoff, wmetphn.
outoff =
0.
hard = 'N'.
inpwlm1 = inpwlen - 1.
" handle initial 2
characters exceptions
CASE inpword+0(1).
WHEN
'K' OR 'G' OR 'P'. " looking for KN, etc
IF inpword+1(1) = 'N'.
wlocal =
inpword+1(inpwlm1).
ELSE.
wlocal =
inpword.
ENDIF.
WHEN
'A'. " looking for AE
IF inpword+1(1) =
'E'.
wlocal =
inpword+1(inpwlm1).
ELSE.
wlocal =
inpword.
ENDIF.
WHEN
'W'. " looking for WR or WH
IF inpword+1(1) =
'R'. " WR -> R
wlocal =
inpword+1(inpwlm1).
ELSE.
IF inpword+1(1) =
'H'.
wlocal =
inpword+1(inpwlm1).
wlocal+0(1) = 'W'. " WH -> W
ELSE.
wlocal =
inpword.
ENDIF.
ENDIF.
WHEN
'X'. " initial X becomes S
wlocal =
inpword.
wlocal+0(1) = 'S'.
WHEN
OTHERS.
wlocal = inpword.
ENDCASE. "
now wlocal has working string with initials fixed
wrdsiz = STRLEN(
wlocal ).
inpoff = 0.
DO.
IF outoff >= maxcodelen. " max metaphone size of 4
works well
EXIT.
ENDIF.
IF
inpoff >= wrdsiz.
EXIT.
ENDIF.
inofm1 =
inpoff - 1.
inofp1 = inpoff + 1.
inofp2 = inpoff + 2.
inpchr = wlocal+inpoff(1).
" remove
duplicate letters except C
IF ( inpchr <> 'C') AND (
inpoff > 0 ) AND ( wlocal+inofm1(1) = inpchr
).
inpoff = inpoff +
1.
CONTINUE.
ENDIF.
CASE
inpchr.
WHEN 'A' OR 'E' OR 'I' OR 'O' OR
'U'.
IF inpoff =
0.
wmetphn+outoff(1) =
inpchr.
outoff =
outoff + 1.
ENDIF. " only use
vowel if leading char
WHEN
'B'.
IF ( inpoff > 0 AND
wlocal+inofm1(1) =
'M'
AND (
inofp1 =
wrdsiz
OR ( inofp2 = wrdsiz AND wlocal+inofp1(1) = 'E' ) ) ).
" MB or MBE
at end of word
inpoff
= inpoff + 1.
CONTINUE.
ELSE.
wmetphn+outoff(1) =
inpchr.
outoff =
outoff + 1.
ENDIF.
WHEN 'C'. " lots of C special cases
" discard if SCI, SCE or
SCY
IF ( inpoff > 0 ) AND (
wlocal+inofm1(1) = 'S' ) AND ( inofp1 < wrdsiz
).
CASE
wlocal+inofp1(1).
WHEN 'E' OR 'I' OR
'Y'.
inpoff = inpoff +
1.
CONTINUE.
ENDCASE.
ENDIF.
IF wlocal+inpoff(3) =
'CIA'. " CIA -> X
wmetphn+outoff(1) =
'X'.
outoff = outoff +
1.
inpoff = inpoff +
2.
CONTINUE.
ENDIF.
IF inofp1 <
wrdsiz.
CASE
wlocal+inofp1(1).
WHEN 'E' OR 'I' OR
'Y'.
wmetphn+outoff(1) =
'S'.
outoff = outoff + 1. " CI,CE,CY ->
S
inpoff = inpoff +
1.
CONTINUE.
ENDCASE.
ENDIF.
IF ( inpoff > 0 ) AND
( wlocal+inofm1(3) = 'SCH' ). "
SCH->sk
wmetphn+outoff(1) =
'K'.
outoff = outoff +
1.
inpoff = inpoff +
1.
CONTINUE.
ENDIF.
IF wlocal+inpoff(2) =
'CH'. " detect CH
IF (
inpoff = 0 ) AND ( wrdsiz >= 3 ). " CH consonant -> K
consonant
CASE
wlocal+2(1).
WHEN 'A' OR 'E' OR 'I' OR 'O' OR
'U'.
wmetphn+outoff(1) = 'X'. " CHvowel ->
X
outoff = outoff +
1.
inpoff = inpoff +
1.
CONTINUE.
ENDCASE.
ELSE.
wmetphn+outoff(1) =
'K'.
outoff = outoff +
1.
inpoff
= inpoff +
1.
CONTINUE.
ENDIF.
ENDIF.
wmetphn+outoff(1) =
'K'.
outoff = outoff +
1.
WHEN
'D'.
IF ( inofp2 < wrdsiz ) AND
( wlocal+inofp1(1) = 'G' ). " DGE DGI DGY ->
J
CASE
wlocal+inofp2(1).
WHEN 'E' OR 'I' OR
'Y'.
wmetphn+outoff(1) =
'J'.
inpoff = inpoff + 2.
ENDCASE.
ELSE.
wmetphn+outoff(1) = 'T'.
ENDIF.
outoff = outoff + 1.
WHEN 'G'. " GH silent at end or
before consonant
IF ( inofp2 =
wrdsiz ) AND ( wlocal+inofp1(1) = 'H'
).
inpoff = inpoff +
1.
CONTINUE.
ENDIF.
IF ( inofp2 < wrdsiz
) AND ( wlocal+inofp1(1) = 'H'
).
CASE
wlocal+inofp2(1).
WHEN 'A' OR 'E' OR 'I' OR 'O' OR
'U'.
inpoff = inpoff + 0. " do
Nothing!
WHEN
OTHERS.
inpoff = inpoff +
1.
CONTINUE.
ENDCASE.
ENDIF.
IF ( inpoff > 0 ) AND
( wlocal+inpoff(2) =
'GN'
OR wlocal+inpoff(4) = 'GNED'
).
inpoff = inpoff +
1.
CONTINUE.
ENDIF. " silent
G
IF ( inpoff > 0 ) AND
( wlocal+inofm1(1) = 'G'
).
hard =
'Y'.
ELSE.
hard =
'N'.
ENDIF.
IF ( inofp1 < wrdsiz
) AND ( hard = 'N' ).
CASE
wlocal+inofp1(1).
WHEN 'E' OR 'I' OR
'Y'.
wmetphn+outoff(1) =
'J'.
WHEN
OTHERS.
wmetphn+outoff(1) =
'G'.
ENDCASE.
ELSE.
wmetphn+outoff(1) = 'K'.
ENDIF.
outoff = outoff + 1.
WHEN
'H'.
IF inofp1 =
wrdsiz.
inpoff =
inpoff + 1.
CONTINUE.
ENDIF. " terminal
H
IF ( inpoff > 0
).
CASE
wlocal+inofm1(1).
WHEN 'C' OR 'S' OR 'P' OR 'T' OR
'G'.
inpoff = inpoff +
1.
CONTINUE.
ENDCASE.
ENDIF.
CASE
wlocal+inofp1(1).
WHEN
'A' OR 'E' OR 'I' OR 'O' OR
'U'.
wmetphn+outoff(1) =
'H'.
outoff = outoff + 1. " Hvowel
ENDCASE.
WHEN 'F' OR 'J' OR 'L' OR 'M'
OR 'N' OR 'R'.
wmetphn+outoff(1) =
inpchr.
outoff = outoff +
1.
WHEN
'K'.
IF inpoff > 0. " not
initial
IF
wlocal+inofm1(1) <>
'C'.
wmetphn+outoff(1) =
inpchr.
outoff = outoff + 1.
ENDIF.
ELSE.
wmetphn+outoff(1) = inpchr. " initial
K
outoff = outoff +
1.
ENDIF.
WHEN
'P'.
IF ( inofp1 < wrdsiz ) AND
( wlocal+inofp1(1) = 'H' ). " PH ->
F
wmetphn+outoff(1) =
'F'.
ELSE.
wmetphn+outoff(1) = inpchr.
ENDIF.
outoff = outoff + 1.
WHEN
'Q'.
wmetphn+outoff(1) =
'K'.
outoff = outoff +
1.
WHEN
'S'.
IF ( wlocal+inpoff(2) = 'SH'
)
OR ( wlocal+inpoff(3) = 'SIO'
)
OR ( wlocal+inpoff(3) = 'SIA'
).
wmetphn+outoff(1) =
'X'.
ELSE.
wmetphn+outoff(1) = 'S'.
ENDIF.
outoff = outoff + 1.
WHEN
'T'.
IF ( wlocal+inpoff(3) = 'TIA'
)
OR ( wlocal+inpoff(3) = 'TIO'
).
wmetphn+outoff(1) =
'X'.
outoff = outoff +
1.
inpoff = inpoff +
1.
CONTINUE.
ENDIF.
IF wlocal+inpoff(3) =
'TCH'.
inpoff = inpoff
+ 1.
CONTINUE.
ENDIF.
" substitute numeral 0
for TH (resembles theta after all)
IF wlocal+inpoff(2) =
'TH'.
wmetphn+outoff(1) = '0'.
ELSE.
wmetphn+outoff(1) = 'T'.
ENDIF.
outoff = outoff + 1.
WHEN
'V'.
wmetphn+outoff(1) =
'F'.
outoff = outoff +
1.
WHEN 'W' OR 'Y'. " silent if not
followed by vowel
IF inofp1 <
wrdsiz.
CASE
wlocal+inofp1(1).
WHEN 'A' OR 'E' OR 'I' OR 'O' OR
'U'.
wmetphn+outoff(1) =
inpchr.
outoff = outoff + 1.
ENDCASE.
ENDIF.
WHEN
'X'.
wmetphn+outoff(1) =
'K'.
outoff = outoff +
1.
wmetphn+outoff(1) =
'S'.
outoff = outoff +
1.
WHEN
'Z'.
wmetphn+outoff(1) =
'S'.
outoff = outoff +
1.
ENDCASE.
inpoff = inpoff + 1.
ENDDO.
metaphone = wmetphn+0(4).
ENDFORM. " end f_metaphone
* Ref: http://aspell.net/metaphone/
*
Program Based on: http://www.wbrogden.com/
*
http://www.wbrogden.com/phonetic/index.html
*
* Objective: get back a list of words that might have a
similar pronunciation.
* This list might be useful if you are looking for
alternate spellings.
*
* The original metaphone algorithm was published by
Lawrence Philips
* in an article entitled "Hanging on the Metaphone" in the
journal
* Computer Language v7 n12, December 1990, pp39-43.
* His
algorithm - translated into Java, and with minor tweaks
* - is what we are
using here.
* Naturally, a phonetic encoding system has to assume a
particular language and culture.
* Here we are using essentially American
English.
*
* The Metaphone Rules
* Metaphone reduces the alphabet to 16
consonant sounds:
*
* B X S K J T F H L M N P R 0 W Y
*
* That isn't
an O but a zero - representing the 'th' sound.
*
* Transformations
*
Metaphone uses the following transformation rules:
* Doubled letters except
"c" -> drop 2nd letter.
* Vowels are only kept when they are the first
letter.
*
* B -> B unless at the end of a word after "m" as
in "dumb"
* C -> X (sh) if -cia- or
-ch-
* S if -ci-, -ce- or
-cy-
* K otherwise, including
-sch-
* D -> J if in -dge-, -dgy- or
-dgi-
* T otherwise
* F ->
F
* G -> silent if in -gh- and not at end or
before a vowel
* in
-gn- or -gned- (also see dge etc. above)
*
J if before i or e or y if not double
gg
* K otherwise
* H
-> silent if after vowel and no vowel
follows
* H otherwise
* J ->
J
* K -> silent if after
"c"
* K otherwise
* L ->
L
* M -> M
* N -> N
* P -> F if before
"h"
* P otherwise
* Q ->
K
* R -> R
* S -> X (sh) if before "h" or in -sio- or
-sia-
* S otherwise
* T ->
X (sh) if -tia- or -tio-
*
0 (th) if before
"h"
* silent if in
-tch-
* T otherwise
* V ->
F
* W -> silent if not followed by a
vowel
* W if followed by a
vowel
* X -> KS
* Y -> silent if not
followed by a vowel
* Y if followed
by a vowel
* Z -> S
*
* Initial Letter Exceptions
*
*
Initial kn-, gn- pn, ac- or wr- -> drop
first letter
* Initial
x-
-> change to "s"
* Initial
wh-
-> change to "w"
*
* The code is truncated at 4 characters in this
example, but more could be used.