I load 5,000,000/day lines of data into a 5,000,000x31/month table partiton. This is scheduled as a batch job everyday.
The table has only 1 key (phone#), non primary, and there is no fancy constraint stuff. The table contains phone call tranactions and it's partitioned by month.
I tried loading the data using direct=true after dropping the index. It was really fast; however, it took a really long time to re-index.
I tried conventional load with index and it took a long time as well.
The SUN is maxed out on CPU and we are looking for the $ to all more
CPUs.
Questions:
1) Should I go direct load and reindex every night
2) Should I go conventional and forget re-indexing
3) Should I run multiple conventional load at the same time? I can
break them into 50x100000 runs.
4) Does direct mode know which partition to feed the data?
5) Should I partition the table further into weeks (it's a painful
way)
Thanks in advance, any suggestions appreciated.
Ken
The box is a SUN E450 with 1x250Mhz cpu, 512M, RAID cache
A1000 scsi array. 5x9G RAID5 for the table partition, 2x9G RAID 0
for index. Should I ditch the 250Mhz cpu and get a number of faster CPUs?
Oracle is 8.0.4
I have not worked on the partitioning recently, and cannot remember
if it
is possible to exchange partition and index, but I would have thought
so.
This does rely on you loading only one partition at a time.
1. Are you creating the index in parallel?
2. Are you killing redo logging?
3. Are you monitoring disk/datafile activity to see where the bottlenecks
are?
4. Are you monitoring the SGA/Oracle buffers to see if they are
contraining the index build?
5. etc., etc., etc.
You need to do a lot more careful analysis, IMHO, as to why this part
of
the process is slow. We shunt 100's of millions of rows into our tables,
but we've had to put a lot of work in to getting the performance right.
It
seems you need to go through the same process.
Steve Phelan.
Suggestion:
Make sure the index is a local index
IF the 5M rows are almost all for the one partition
(and you can tell which one), then do one direct
load
for that partition, and do the rest of the loading
using the
conventional path. Do not drop the index before
loading
in either case.
You will need to filter the data set into the 'main
load' and
'the rest' before starting.
The aim of the strategy is that adding a few rows
to each partition
using a conventional load and updating the index
as you go is
pretty cheap (compared to the cost of even a very
efficient rebuild of
a 150M row index). But for the large load, adding
5M rows to an index
is a big hit - which can be reduced by having to sort
only the index
entries for that 5M rows, then merge with the existing
0 - 150M.
Be aware that the space needed to allow the final index
merge after the
direct load will be slightly larger than 2 x the maximum
index
(partition) size.
Answers to your questions:
d) Yes, direct path does know about partitions
e) 'Best choice' of partition size is dependent
on both load and query characteristics,
so
there is no quick answer to this
question.
--
Jonathan Lewis
Yet another Oracle-related web site: www.jlcomp.demon.co.uk
Reply:
>A couple of points for clarification:
>a) Is the single index a local index or a global
index
Local already. I have no guts to try global...
>b) 5M calls per day => 150,000,000 per partition
Yes. 5M calls per day. If I partition the table by month it willl
be
150,000,000 per parition. That's why I'm thinking about breaking
it
down to "week of year" partition: 1999w20, 1999w21 etc. I have
to
rewrite some of the scripts; and that seems to get pretty ugly.
The
queries that I run are all weekly or monthly reports with lots of
"group by" and "sum".
>c) Is each load strictly for a given partition, or could a single load
In the ideal case, each load should contain the data of 1 day only.
The problem is the end of month. Example:
19990320.ASC contains all the data of the day. It goes into table
call.transaction partition 199903.
One of the problem is the possibility that the data file contains data
that span over 2 partitions.
Example: 19990401.ASC might contain some data in 19990331 or even
19990330. i.e. THERE IS NO WAY TO PREDICT WHICH DATA IS IN THE FILE.
I
hope that the sqlldr will figure out the corresponding partition in
direct mode.
>hit more than one month.
>d) How many months are you keeping
2 months for now.
>e) How much data do you have loaded so far
I tried loading 1/2 month using direct while the index is dropped.
It
works. That's really fast. Creating the index after the
fact took
many many hours (havn't really mearsured it).
If I just defer the index in the direct load and re-build afterwards,
will it be much faster? (i.e. NOT dropping the index)
I got lots of those "index in unusable mode" kind of error last time
I
tried.
>
>
>Suggestion:
> Make sure the index is a local index
>
> IF the 5M rows are almost all for the one partition
> (and you can tell which one), then do one direct
load
> for that partition, and do the rest of the loading
using the
> conventional path. Do not drop the index
before loading
> in either case.
>
>
> You will need to filter the data set into the 'main
load' and
> 'the rest' before starting.
I have written scripts to pre-format the data to certain extend for
Y2K. I wonder what kind of penality I have if I do too much date
comparison logic in the sqlldr contrl file. One options
is to add a
1 char FLAG to determine the partition.
>
> The aim of the strategy is that adding a few rows
to each partition
> using a conventional load and updating the index
as you go is
> pretty cheap (compared to the cost of even a very
efficient rebuild of
> a 150M row index). But for the large load, adding
5M rows to an index
> is a big hit - which can be reduced by having to sort
only the index
> entries for that 5M rows, then merge with the existing
0 - 150M.
>
Can you give me an example, please?
> Be aware that the space needed to allow the final index
merge after the
> direct load will be slightly larger than 2 x the maximum
index
>(partition) size.
Buying Disk is not a problem. I just don't want to MAX out the
CPU
slots yet. I can buy anything as long as I claim it's for Y2K
(yay!!!)
>
>
>Answers to your questions:
>d) Yes, direct path does know about partitions
good to hear that. Do I specifiy any logic in the control file?
or
it's smart enough?
>e) 'Best choice' of partition size is dependent
> on both load and query characteristics,
so
> there is no quick answer to this
question.
Assuming that I'll break the partitions into "week of year", do you
see any impact on the queries for the month end?
Example:
select phone, sum (call duration) from call.transaction group by phone
where transaction_date in march;
The query should run, but I don't know whether it will be a penality
or improvment.
BTW thanks to all who responsed.
KC