When to use SQL*Loader?


Hi,

I load 5,000,000/day  lines of data into a 5,000,000x31/month table partiton.  This is scheduled as a batch job everyday.

The table has only 1 key (phone#), non primary, and there is no fancy constraint stuff. The table contains phone call tranactions and it's partitioned by month.

I tried loading the data using direct=true after dropping the index. It was really fast; however, it took a really long time to re-index.

I tried conventional load with index and it took a long time as well.

The SUN is maxed out on CPU and we are looking for the $ to all more
CPUs.

Questions:

1) Should I go direct load and reindex every night
2) Should I  go conventional and forget re-indexing
3) Should I run multiple conventional load at the same time? I can
break them into 50x100000 runs.
4) Does direct mode know which partition to feed the data?
5) Should I partition the table further into weeks (it's a painful
way)
 
Thanks in advance, any suggestions appreciated.

Ken

The box is  a SUN E450 with 1x250Mhz cpu, 512M, RAID  cache A1000 scsi array. 5x9G RAID5  for the table partition, 2x9G RAID 0 for index. Should I ditch the 250Mhz cpu and get a number of faster CPUs? Oracle is 8.0.4



Ans1:
This is only a suggestion, but have you considered exchanging the
partition out, perform the load, without the index, reindex the non
partitioned table, and then exchange the partition and index back in.

I have not worked on the partitioning recently, and cannot remember if it
is possible to exchange partition and index, but I would have thought so.

This does rely on you loading only one partition at a time.



Ans2:
The key bit is 'it took a really long time to re-index.' Now all you have
to do is find out *why*.

1. Are you creating the index in parallel?
2. Are you killing redo logging?
3. Are you monitoring disk/datafile activity to see where the bottlenecks
are?
4. Are you monitoring the SGA/Oracle buffers to see if they are
contraining the index build?
5. etc., etc., etc.

You need to do a lot more careful analysis, IMHO, as to why this part of
the process is slow. We shunt 100's of millions of rows into our tables,
but we've had to put a lot of work in to getting the performance right. It
seems you need to go through the same process.

Steve Phelan.



Ans3:
A couple of points for clarification:
a)    Is the single index a local index or a global index
b)    5M calls per day  => 150,000,000 per partition
c)    Is each load strictly for a given partition, or could a single load
hit more than one month.
d)    How many months are you keeping
e)    How much data do you have loaded so far
 

Suggestion:
    Make sure the index is a local index

    IF the 5M rows are almost all for the one partition
    (and you can tell which one), then do one direct load
    for that partition, and do the rest of the loading using the
    conventional path.  Do not drop the index before loading
    in either case.
 

    You will need to filter the data set into the 'main load' and
    'the rest' before starting.

    The aim of the strategy is that adding a few rows to each partition
    using a conventional load and updating the index as you go is
    pretty cheap (compared to the cost of even a very efficient rebuild of
   a 150M row index).  But for the large load, adding 5M rows to an index
   is a big hit - which can be reduced by having to sort only the index
   entries for that 5M rows, then merge with the existing 0 - 150M.

   Be aware that the space needed to allow the final index merge  after the
   direct load will be slightly larger than 2 x the maximum index
(partition) size.
 

Answers to your questions:
d)    Yes,  direct path does know about partitions
e)    'Best choice' of partition size is dependent
       on both load and query characteristics, so
       there is no quick answer to this question.

--

Jonathan Lewis
Yet another Oracle-related web site: www.jlcomp.demon.co.uk

Reply:
>A couple of points for clarification:
>a)    Is the single index a local index or a global index

Local already.  I have no guts to try global...

>b)    5M calls per day  => 150,000,000 per partition

Yes.  5M calls per day. If I partition the table by month it willl be
150,000,000 per parition.  That's why I'm thinking about breaking it
down to "week of year" partition: 1999w20, 1999w21 etc.  I have to
rewrite some of the scripts; and that seems to get pretty ugly.  The
queries that I run are all weekly or monthly reports with lots of
"group by" and "sum".

>c)    Is each load strictly for a given partition, or could a single load

In the ideal case, each load should contain the data of 1 day only.
The problem is the end of month.  Example:

19990320.ASC contains all the data of the day.  It goes into table
call.transaction partition 199903.

One of the problem is the possibility that the data file contains data
that span over 2 partitions.

 Example: 19990401.ASC might contain some data in 19990331 or even
19990330. i.e. THERE IS NO WAY TO PREDICT WHICH DATA IS IN THE FILE. I
hope that the sqlldr will figure out the corresponding partition in
direct mode.

>hit more than one month.
>d)    How many months are you keeping

2 months  for now.

>e)    How much data do you have loaded so far
I tried loading 1/2 month using direct while the index is dropped.  It
works.  That's really fast.  Creating the index after the fact took
many many hours (havn't really mearsured it).

If I just defer the index in the direct load and re-build afterwards,
will it be much faster? (i.e. NOT dropping the index)

I got lots of those "index in unusable mode" kind of error last time I
tried.

>
>
>Suggestion:
>    Make sure the index is a local index
>
>    IF the 5M rows are almost all for the one partition
>    (and you can tell which one), then do one direct load
>    for that partition, and do the rest of the loading using the
>    conventional path.  Do not drop the index before loading
>    in either case.
>
>
>    You will need to filter the data set into the 'main load' and
>    'the rest' before starting.

I have written scripts to pre-format the data to certain extend for
Y2K.  I wonder what kind of penality I have if I do too much date
comparison logic in the sqlldr contrl file.   One options is to add a
1 char FLAG to determine the partition.

>
>    The aim of the strategy is that adding a few rows to each partition
>    using a conventional load and updating the index as you go is
>    pretty cheap (compared to the cost of even a very efficient rebuild of
>   a 150M row index).  But for the large load, adding 5M rows to an index
>   is a big hit - which can be reduced by having to sort only the index
>   entries for that 5M rows, then merge with the existing 0 - 150M.
>

Can you give me an example, please?

>   Be aware that the space needed to allow the final index merge  after the
>   direct load will be slightly larger than 2 x the maximum index
>(partition) size.

Buying Disk is not a problem.  I just don't want to MAX out the CPU
slots yet.  I can buy anything as long as I claim it's for Y2K
(yay!!!)

>
>
>Answers to your questions:
>d)    Yes,  direct path does know about partitions

good to hear that.  Do I specifiy any logic in the control file? or
it's smart enough?

>e)    'Best choice' of partition size is dependent
>       on both load and query characteristics, so
>       there is no quick answer to this question.

Assuming that I'll break the partitions into "week of year", do you
see any impact on the queries for the month end?

Example:
select phone, sum (call duration) from call.transaction group by phone
where transaction_date  in march;

The query should run, but I don't know whether it will be a penality
or improvment.

BTW thanks to all who responsed.

KC



Ans4:
You might want to look at a RAID solution or level that doesn't have such
a large write penalty as RAID 5.
 
 
Hosted by www.Geocities.ws

1