|
|
|||||||||
|
|||||||||
|
|||||||||
| |
|||
| ||||||||||||||||||||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
DataBase Normalization
Yeah, I know, groan... But I have a question I would like people's opinion on. Rather than use one of the examples everyone always uses to illustrate normalization, I am going to use a real world example. Now, I'm building a database to hold all the major league baseball statistics. In order to avoid confusion where 2 players have the same name, I will give every player a unique id number, just an integer from 1 to however many, no relation to anything in the real world. Now I have a choice. I can store every player in a list called active_players_2004, for example, and use a team name field to tell me what team he is on. This way I could pull out those player who played for the Cubs, let's say. But, of course, this would result in the value "Chicago Cubs" being saved multiple times in the table. Let's also pretend that giving each team a number and storing the number instead of the name because it is a "smaller" data type is dumb, how else can we fix this problem? We could, instead, have a number of tables, one for each team, for example, 2004_Chicago_Cubs, and store each player's stats, keyed by player_id, under the appropriate table. This would be great for the instance where I want to return an entire team, but suppose I want to retrieve Sammy Sosa's stats, and I don't know he plays for the Cubs? Now I have to search every team table until I find his player_id. And, I can't just stop there, because he may have been traded in mid-season and have stats with another team. So I will have to search each team table to find all occurences of his player_id, to get all of his stats for that year.
I suppose I could have a table for each team, containing just the player_ids AND an active_players_2004 table containing all the stats keyed by player_id, but that just seems like an excuse for more tables. Anybody have any thoughts about this? |
|
#2
|
||||
|
||||
|
The best design is to have a player table and a team table. Each player row contains a key corresponding to the unique id for the player's current team.
It's not that storing a key rather than a text string is better because it's small but that it's more efficient. Say the team name changes, for example. You update in one place rather than 26 and there's no room for data-entry error ("Cubs" vs. "cubs" for example). Using normalized tables also solves a problem of saving player histories. If you had a table per team per year, you'd have a db management nightmare pretty quickly. But add a history table that stores a player id, a team id, and a signing/trade date and you've got an efficient way of tracking a player's career path. (Players with a record in this table for the current year are considered active, eliminating that extra table.) Then you have a stats table that holds player_id, (current) team_id, date, stat_type, and stat, allowing you very easily to track a player's stats over time, across teams, or for a given team.
__________________
Please don't PM me asking for solutions outside the scope of a thread. Keeping all responses in a thread stands to help others who come along later, which is after all what this forum's all about. |
|
#3
|
|||
|
|||
|
OK, actually you have sort of pointed me the way to go, I think. So an active_player table (this is still needed since I stored Biographical info here, things that don't change much, height, weight, DOB, etc.) and a Teams table, with the team_id being a foreign key in the active_players table?
Do you really think a per year table would become a nightmare? I ask because I intended to have a batting_stats table, a pitching_stats table, and a fielding_stats table for each year. The reason for breaking the stats into 3 seperate tables is because a DH wouldn't have fielding or pitching stats, and some pitchers won't have batting stats. I also envisioned the team id as a foreign key here, because, if a guy gets traded in mid season, he could just have 2 stat lines, one for each team. You could add them together to get his overall stats for the year, but usually (on his player card for example) they are broken out seperately. One of the main reasons to do a per year with the stats (and, man, there are a lot of them....) is that I could see only keeping say the last 10 years in the DB, and pulling the older ones off into other storage (tape or CD or something). Thanks for the help on my earlier question. I appreciate the time... |
|
#4
|
||||
|
||||
|
There's still no need for an active_player table (if you're thinking of it as being in addition to a players table). You store all the height/weight/etc. info in the players table. I don't see that you gain anything by splitting the info out when you've already got a perfectly good table for holding this info; and on top of that, every time a player goes inactive, you have to move his record. If you let the history/stats tables dictate who's active (if you don't have stats in 2004, you're not active in 2004). This paragraph only applies if you were thinking in terms of having a general players table and a separate active_players table; this would just be more work.
I do think that per-year tables are a very very bad idea, especially if you go down the path of having multiple year-specific tables. You wind up having to do klugey selects to get cross-year aggregate data and you wind up with a bazillion tables, which is just untidy, especially if there's no other advantage gained. My approach would be roughly as follows: Code:
CREATE TABLE stat_types(
id int(10) unsigned zerofill NOT NULL primary key auto_increment,
label varchar(20), -- "Pitching", "Batting", "Fielding"
active enum('Y','N')
);
CREATE TABLE stats(
id int(10) unsigned zerofill NOT NULL primary key auto_increment,
player_id int(10) unsigned zerofill,
team_id int(10) unsigned zerofill, -- your inclination to have a team id in this table was correct
stat_type int(10) unsigned zerofill, -- key to stat_types
stat int(5), -- or float or whatever's appropriate
year int(4),
date_entered datetime
);
This allows you to differentiate among types of stats, to track a given stat against player or team (making it easy to get a player's aggregate stats or a *team's* aggregate stats over a given period), and to have only one concise stats table to maintain. It also makes pulling the aggregate data out much easier than pulling out of multiple tables: Code:
SELECT AVG(stat) FROM stats, players WHERE stats.user_id=players.id AND stat_type=1 AND (stats.year > 1996 AND stats.year < 2000) To get the same set of stats for a team rather than for a player, you just join stats.team_id to team.id instead of stats.user_id to players.id and you're done in one simple query with no messy table jockeying. My example doesn't accommodate subtypes of stats, I guess (ie, strikeouts vs. hits or balls vs strikes within a main stat type), but that's easily enough done by adding another lookup table and a stat_subtype field in the stats table that's keyed to that lookup table. This would change your query only in that you'd have to add another restriction to look for subtype in addition to type. As for archiving old stats, that's no problem here either. Just do a data dump and then delete from stats where year < 1994 and you're done without having to delete tables or modify code to keep it from trying to hit the deleted tables. |
|
#5
|
|||
|
|||
|
Thanks very much, DHouston. You have convinced me not to use the per year tables idea. I am still thinking of using seperate tables for the batting stats, the pitching stats, and the fielding stats. I understand why you suggest a seperate stat subtypes table, that way if I add a new stat, it would be easy to put it into the DB. The main reason I'm thinking of keeping the stats seperate is that I have concerns about performance. Just to illustrate, there are 36 subtypes of batting stats, 36 subtypes of pitching stats, and 10 subtypes of fielding stats. There are 16 teams in the National League, each team carries 12 pitchers(at least until roster expansion in Sept), each pitcher would have 82 seperate stat lines per year. That's 12x16x82=15,744 stat lines for the National League pitchers. Figure about the same for the American League pitchers (since we could count the DH and pitcher as the same person). Also, figure about 40,000 lines for the batters, as a conservative estimate (remember, we have to count everyone who had a single at bat), and that's 72K lines per year. I am also tracking Triple A stats, so double that to 144K lines per year. That is almost 1 and a half million lines for a ten year period. Wouldn't I pay a pretty hefty performance penalty for searching for 1 person's stats?
Another reason for keeping the stats seperate is that you never look at a combination of the stats. You always view either batting stats, pitching stats, or fielding stats. The final reason for seperating them is that I have a client side app for entering the data, and if I add a stat, I am completely screwed already, in that I have to rewrite the client side app. rewriting the DB and the server side stuff would be the easy part...I suppose I could have a stats subtype table AND 3 seperate tables for stats, hoping that no one stat table would become more the half a million lines. I wonder how much of a performance penalty I would pay for a search then? Anyway, I really appreciate the time you have already spent helping me. I would really like to hear your thoughts on the performance issues given the numbers I postulated, if you have time. Thanks again, |
|
#6
|
||||
|
||||
|
I'm not sure performance would really be an issue if you index your tables correctly, but it sounds like you've got other compelling reasons to keep stat types separate, so I say keep 'em separate. It's really a judgment call you have to make based on your hardware, your client-side app restrictions, etc. Glad I could help, and I hope my advice turns out to have been good advice.
![]() |
|
#7
|
|||
|
|||
|
Yes, thank alot DHouston. You have been a great help. Sorry I took so long to get back and say that. I had to rush outta town cause a friend had surgery.
Thanks again, |
![]() |
| Viewing: Dev Articles Community Forums > Databases > Database Development > DataBase Normalization |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|