Weston Rowley's Data Mining Log

Wednesday, August 26, 2009

Long Time No Blog

So, I've officially been a slacker when it comes to posting.

Since I last posted, things have been fast but interesting. I started working on a contest sponsored by UCSD, which we did alright in. We weren't spectacular, but considering that so many of us were rookies, I'd say we did pretty well.

Toward the end of the contest, I started helping finish a project with PolySci department. We were analyzing campaign contribution records to preform record linkage on donors, and it was my responsibility to find a metric to describe how well we did. This was accomplished by comparing manual clusters with computer generated clusters, and evaluating how similar the two were. All in all, it was very successful, and I'll hopefully be getting my name on a publication.

Tuesday, June 9, 2009

This last week has been frustrating. I spent almost the entire last week trying to figure out how to upload three tables from one part of the server into MySQL's databases on the server. So far, it's been one gigantic trial. With the help of Reed, I've managed to get two of the tables on. Unfortunatly, the third table is proving to be very difficult. It refuses to upload correctly the way either of the other two databases uploaded. I've tried going into MySQL and using the INSERT DATA INFILE command, I've tried staying in linux and using the databasename < filepath command, I've tried using the mysqlimport command, all to no avail. I've gotten a little more then fed up with this. I guess it's been a learning experience, I've had to learn alot more about how a database is put together. I've also gotten to learn some more linux commands.

Update: Just finished uploading the tables. I think I actually only need to have two of the tables - the two orders tables actually contain identical columns, so I'll just mine one of them for now. I've started extracting useful columns from the tables.

I'm having trouble coming up with a good business objective to mine for. One thing that could be interesting is to try to identify what might prompt a user to make a purchase through a particular channel. Another interesting thing could be to figure out patterns in where people tend to shop in relation to where they live.

Wednesday, May 27, 2009

It's 20 May now, which means that it's been a week since I last posted. That's pretty negligent on my part, so I figure it's time to post again.

Actually, the truth of the matter is that I just got help in figuring out how to do a particular PostgreSQL command. That was probably about 10-15 minutes ago. It's still running, so I decided to take some time to update my blog. I really hope it doesn't crash; I really have no way to tell if its still running. In all reality, it could have crashed long ago and I wouldn't know. On the other hand, it could be running fine, and if I cancel it to retry, it could ruin all the progress that it has so far. I wish I knew why my server querries have been taking so long; the time that they take seems to get longer every day. I just lost and reestablished connection to the server, so I should probably cancel and retry the command.

I'm still trying to build a table with information for all the individual customers - cutting out wholesalers, coorperations, government entities, churches, exc. The first problem I've run into is that customers are often mislabeled. I have to look not only at their class, but also at the profile class. Some of these seem to be significant, and others seem to just be randomly assigned. I've done some filtering that I think has identified viable customers, but I really can't be completely sure without sifting through the 300000 odd customers. I don't plan on doing that.

Now that I've identified customers, I'm trying to put together information for them. In particular, I'm trying to have the computer count the number of purchases they've made through various channels, then sum the values of those purchases. That wouldn't be too difficult, except that it takes so long to run each one of the queries that I pretty much have to get it right the first time I do it. The computer is still processing the first of those querries - counting the retail purchases each customer has made. I hope it's done soon so I can move on to another part of the table.

Wednesday, May 20, 2009

Postgres holds one less secret from me

Today was a slow but productive day. I spent a good portion of it figuring out how to export a table from a postgres database, and I think I was altogether successful. I pulled a few columns off the orders table, which I hope will help me in creating a random sample size of customers. The difficulty is, I need to find totals on what each customer ordered, so I needed all the rows in the orders table. I selected these rows on the condition that they're tied to a customer who is labeled as a Consumer, so that likely cut out alot of unnecicery rows. The problem is, I still have millions of rows, so I'm having trouble finding a progrem that can handle all that data. I could probably write a script that would handle it for me, but I don't know enough progreming to be able to do that. Instead, I'm going to try to clean the data with microsoft access, using their embedded VBA to look at all the data. Should be an adventure, I think.

I just got Access to open the orders and customer sample databases. Next on my list of things to do is figure out how VBA in access differs from VBA in Excell. I'll be pretty happy when I get most of the data on the customer sheet, because then I can open it in excel and just work from there. I'm sure I'll do fine with Access, though.

Friday, May 15, 2009

Figuring out what I'm after

I've reviewed the fields on the numerous tables in the database I'm studying, and I'm pretty sure I know what most of them mean. I guess the next step in the process is to figure out what I'm trying to find out in the data, so I can construct additional data and clean existing data accordingly. I think that it might be useful to create a method for identifying valuable customers. I might do this by creating an ordered hierarchy of customers based on how much they spent in a given year (or the average over several years) and assigning each customer a percentile. Then, by looking at the total amount spent by each percentile (likely banded to 2 - 5 percentiles per band) I could identify what percent of the profits come from what percent of the customers. (For instance, everyone at or above the 75th percentile contributed more then 60% of the total profits.)

I'm still not sure how to attack the issue of customer's spending channels - typifying what kind of customers spend how much through which avenues. I think that identifying "valuable" customers may be the first step. Much of this work has already been done (via RFM values for each customer) but I think a different approach might be to just throw out the RFM and look directly at the numbers. On the other hand, it might be enlightening to analyze the RFM values.

I'll have to do some thinking about how best to proceed; I want to find information that will be useful.

Wednesday, May 13, 2009

Today's Productivity

I feel that today was productive. I took a bit of a break from exploring my database and decided to learn how to export a database so it could be analyzed. Fortunately, phpPg has a handy little export button, because I haven't figured out how to do that from the command line yet. I joined two tables, one containing records of customer's first purchases, and joined it to a table containing records of a customer's last purchase, joining the two by the customer's id number. I instructed my querry to take 100 rows at random. I then took these 100 rows and downloaded them as a comma-seperated file, which I opened with Excel. I tried just throwing that file into Weka, but Weka didn't like it very much. I then went back to Excel and used some macros to clean the data. After that, I put it back into Weka, with much better results.

I still need to learn how to clean data without putting it into Excel, but for now, I'm just happy that I could get data from a database and into an algorithm.

Tuesday, May 12, 2009

Diving into Databases

Alot has happened in the last couple of days. I've

Figured out a bit more about using Putty to navigate a PostGres database
Gotten a login to the server for phpPgAdmin so I don't always have to use Putty
Learned a little bit about how Php scripting works
Been introduced to Eclipse, but currently still don't know any Java to program in it

All in all, things are going well. I don't have a data mining objective yet, so I've been analyzing and exploring the database(s) to familiarize myself with the different fields. I really wish it had come with some documentation; I can tell what most of the fields represent, but others have titles that are entirely meaningless to me. There's also several tables that I can tell are significant, but have no way of knowing exactly what they represent. (Like, the difference between direct_lastorder and direct_lastorder_hold)

I'm currently compiling a list organized by table of individual fields that look promising. I'm particularly focusing on fields that could correlate customer characteristics and purchasing habits. I'm also thinking that it might be interesting to mine data on individual products and find correlations between customers and the products they tend to buy, and through what channels.

All of my current objectives will require learning alot more about database access. I'll also have to figure out how to create new fields that are built by comparing several others. I still have a huge learning curve ahead of me, but I can at least see the sequence of steps I'm going to have to go through to get there.