Weston Rowley's Data Mining Log: May 2009

Wednesday, May 27, 2009

It's 20 May now, which means that it's been a week since I last posted. That's pretty negligent on my part, so I figure it's time to post again.

Actually, the truth of the matter is that I just got help in figuring out how to do a particular PostgreSQL command. That was probably about 10-15 minutes ago. It's still running, so I decided to take some time to update my blog. I really hope it doesn't crash; I really have no way to tell if its still running. In all reality, it could have crashed long ago and I wouldn't know. On the other hand, it could be running fine, and if I cancel it to retry, it could ruin all the progress that it has so far. I wish I knew why my server querries have been taking so long; the time that they take seems to get longer every day. I just lost and reestablished connection to the server, so I should probably cancel and retry the command.

I'm still trying to build a table with information for all the individual customers - cutting out wholesalers, coorperations, government entities, churches, exc. The first problem I've run into is that customers are often mislabeled. I have to look not only at their class, but also at the profile class. Some of these seem to be significant, and others seem to just be randomly assigned. I've done some filtering that I think has identified viable customers, but I really can't be completely sure without sifting through the 300000 odd customers. I don't plan on doing that.

Now that I've identified customers, I'm trying to put together information for them. In particular, I'm trying to have the computer count the number of purchases they've made through various channels, then sum the values of those purchases. That wouldn't be too difficult, except that it takes so long to run each one of the queries that I pretty much have to get it right the first time I do it. The computer is still processing the first of those querries - counting the retail purchases each customer has made. I hope it's done soon so I can move on to another part of the table.

Wednesday, May 20, 2009

Postgres holds one less secret from me

Today was a slow but productive day. I spent a good portion of it figuring out how to export a table from a postgres database, and I think I was altogether successful. I pulled a few columns off the orders table, which I hope will help me in creating a random sample size of customers. The difficulty is, I need to find totals on what each customer ordered, so I needed all the rows in the orders table. I selected these rows on the condition that they're tied to a customer who is labeled as a Consumer, so that likely cut out alot of unnecicery rows. The problem is, I still have millions of rows, so I'm having trouble finding a progrem that can handle all that data. I could probably write a script that would handle it for me, but I don't know enough progreming to be able to do that. Instead, I'm going to try to clean the data with microsoft access, using their embedded VBA to look at all the data. Should be an adventure, I think.

I just got Access to open the orders and customer sample databases. Next on my list of things to do is figure out how VBA in access differs from VBA in Excell. I'll be pretty happy when I get most of the data on the customer sheet, because then I can open it in excel and just work from there. I'm sure I'll do fine with Access, though.

Friday, May 15, 2009

Figuring out what I'm after

I've reviewed the fields on the numerous tables in the database I'm studying, and I'm pretty sure I know what most of them mean. I guess the next step in the process is to figure out what I'm trying to find out in the data, so I can construct additional data and clean existing data accordingly. I think that it might be useful to create a method for identifying valuable customers. I might do this by creating an ordered hierarchy of customers based on how much they spent in a given year (or the average over several years) and assigning each customer a percentile. Then, by looking at the total amount spent by each percentile (likely banded to 2 - 5 percentiles per band) I could identify what percent of the profits come from what percent of the customers. (For instance, everyone at or above the 75th percentile contributed more then 60% of the total profits.)

I'm still not sure how to attack the issue of customer's spending channels - typifying what kind of customers spend how much through which avenues. I think that identifying "valuable" customers may be the first step. Much of this work has already been done (via RFM values for each customer) but I think a different approach might be to just throw out the RFM and look directly at the numbers. On the other hand, it might be enlightening to analyze the RFM values.

I'll have to do some thinking about how best to proceed; I want to find information that will be useful.

Wednesday, May 13, 2009

Today's Productivity

I feel that today was productive. I took a bit of a break from exploring my database and decided to learn how to export a database so it could be analyzed. Fortunately, phpPg has a handy little export button, because I haven't figured out how to do that from the command line yet. I joined two tables, one containing records of customer's first purchases, and joined it to a table containing records of a customer's last purchase, joining the two by the customer's id number. I instructed my querry to take 100 rows at random. I then took these 100 rows and downloaded them as a comma-seperated file, which I opened with Excel. I tried just throwing that file into Weka, but Weka didn't like it very much. I then went back to Excel and used some macros to clean the data. After that, I put it back into Weka, with much better results.

I still need to learn how to clean data without putting it into Excel, but for now, I'm just happy that I could get data from a database and into an algorithm.

Tuesday, May 12, 2009

Diving into Databases

Alot has happened in the last couple of days. I've

Figured out a bit more about using Putty to navigate a PostGres database
Gotten a login to the server for phpPgAdmin so I don't always have to use Putty
Learned a little bit about how Php scripting works
Been introduced to Eclipse, but currently still don't know any Java to program in it

All in all, things are going well. I don't have a data mining objective yet, so I've been analyzing and exploring the database(s) to familiarize myself with the different fields. I really wish it had come with some documentation; I can tell what most of the fields represent, but others have titles that are entirely meaningless to me. There's also several tables that I can tell are significant, but have no way of knowing exactly what they represent. (Like, the difference between direct_lastorder and direct_lastorder_hold)

I'm currently compiling a list organized by table of individual fields that look promising. I'm particularly focusing on fields that could correlate customer characteristics and purchasing habits. I'm also thinking that it might be interesting to mine data on individual products and find correlations between customers and the products they tend to buy, and through what channels.

All of my current objectives will require learning alot more about database access. I'll also have to figure out how to create new fields that are built by comparing several others. I still have a huge learning curve ahead of me, but I can at least see the sequence of steps I'm going to have to go through to get there.

Wednesday, May 6, 2009

I've slacked on my posts these last few days

Quite a bit has happened in the last couple of days, which is part of the reason why I haven't been updating often enough. I've completed all the reading on methods of Data Mining, basic algorithms, and ways to attack the problem. That was all very interesting, but I was more then ready to be done with the reading and on to the application.

I decided to start learning about how to use Weka, an open-source program that contains many algorithms for data mining, as well as several GUIs for running and comparing them. I still wish that I knew more about the different algorithms (when and how to apply them, what kind of input they like, exc.) but I'll have to figure that out as I go along. I did some extensive experementation in Weka with several small databases that it comes with, which has allowed me to gain alot of familiarity with the ways to crunch and interpret the data.

Mostly I'm waiting to get my login to the server so I can start mining Deseret Book's customer database. I don't know yet what my objective is in doing that, but I think it will likely take some time to familiarize myself with the tools we use to extract information from the server, clean the data, and put it into at minable database. However, I am looking foreward to getting some mining goals so I can know what I'm looking for. Reading about how these programs work is great, but I'm getting tired of just reading.

Friday, May 1, 2009

More fun with learning

Yesterday and today i did some reading on processes used in assessing, attacking, and solving problems with Data Mining. One thing of particular interest that I just read is an overview of the way that Neural Networks work. I had already started to wonder if Data Mining could be implemented in Chemical Plants as a part of process control, and I just read that one of the best tools for creating predictive models of chemical plants is through a Neural Network. This makes sense, as literally thousands of variables, some which are outside of the plant operator's control, must be taken into account when optimizing a chemical process. Even though a neural network perhaps cannot understand the scientific reasoning behind why the correlations it is able to draw are valid, this does not seem to matter so far as the overall model is accurate. I am finding the idea of data mining as it applies to predictive modeling (I'm guessing that that is also related to machine learning) to be very interesting.

Weston Rowley's Data Mining Log