Tuesday, May 08, 2007

importing excellence in my sequel

i talked to a good friend of mine over the weekend. he wrote a program way back in the day to handle transforming of databases like, in particular it can handle converting excel files to sql statements.

lost already? well, this post is mainly for programmers and the curious.

while his script was wonderful and really graphical, i found that it was limitting. limitting because, and he knows about this too, the size of the file you are trying to convert is limited. it gobbles up resources quickly so trying to process a 10mb file will result in a 'fatal error'.

it has actually been a long time since i last used his script. why? well i found things to make my life easier.

<the easiest way to import an excel file to a mysql database is to use phpmyadmin. all you have to do is save the excel file as a CSV file or a Comma Separated Values file. from there, login to phpmyadmin, go to the table you want to import the data into, click import, and there it is!

if you don't have phpmyadmin or if the file is too big the browser might timeout uploading it, i wouldn't recommend using the command-line version built into mysql(mysqlimport thingy). too many things can go wrong. what you can do is download this little app i found in the internet which took care of that. its called the csv to sql converter. neat, huh? you can download it from here.

i had to import a 35mb csv file to a table recently. it's the zipcode database for the entire US. its huge. phpmyadmin didn't allow me to import it because it was too big, there's a 2mb limit in phpmyadmin. this app came to the rescue.

here are the steps:
1.) in excel, save the data as CSV. d-oh.
2.) run csv2sqlconverter.exe. it doesn't install into your computer, so don't worry.
3.) in the input file, browse for the CSV file.
4.) create a dummy sql file to store the result in, and browse for that in the output file.
5.) fill in the table name.
6.) fill in the field names you want your data to be inserted into(the column names/heading in excel). this is especially useful if you want to use mysql's autoincrement feature for the primary id.
7.) select whether or not you want to split the result file into smaller files. especially useful for big files, that way you can upload them safely without worry.
8.) press convert!
9.) locate the files and upload them or transfer them to the remote directory or working directory where you can access mysql via command line.
10.) open your command-line client.
11.) type in: mysql db_name < ouput_file.sql assuming no user is required, if required, adjust syntax accordingly(mysql -u user -p db_name < output_file.sql
12.) repeat step 11 for each of the output file.

there ya go! looks like a lot of stepes, but trust me, it shouldn't take more than two minutes(not counting upload time).

since we're in the topic of mysql, let me share with you a recent... encounter. i had a script which needed to pull data from 3 related tables(each having about 5k rows each). naturally, i used joins. when i deployed it in the real world, it didn't even last 5 minutes. it clogged up the database, and brought down the site. the reason? joins, by nature, are highly inefficient. in this particular case, my query, since there were additional criteria for each table being queried, pushed it to the very limit of its theoretical programming complexity: n*n*n.

the solution? there is no clear-cut solution for any situation. one very good and useful advice i can offer, like in this case, is to break down the query to smaller single table queries. the aim is to distribute the load between the database and the script. you can do that by nested loops, each loop level translating to a table in the database, with the innermost loop containing the least dependent table(in a situation where a.id = b.a_id and b.id = c.b_id and a.field_1 = something and b.field_1 = something and c.field_1 = something order by c.field_2, this is absolutely useful).

what's the difference? the difference is more likely than not, you will not need the entire result set. you'll probably just need the first 10 or 20. with a database query, the database would process all the rows and come up with the entire result set before truncating the result set to the first 10 or 20 you need. whereas in a nested loop, you can issue a break once you get the first 10! that, my friends, is the big difference, especially in the real world of shared hosting environment.

well, i hope tonight's post was of some help!

No comments: