Database Development
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
User Name:
Password:
Remember me
 
Go Back   Dev Articles Community ForumsDatabasesDatabase Development

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Display Modes
 
Unread Dev Articles Community Forums Sponsor:
Stay one step ahead of the competition. Evaluate and give feedback on some of the hottest web development tools on the market today. Make your opinion heard! Click Here
  #1  
Old March 4th, 2004, 01:57 AM
Buster77 Buster77 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 28 Buster77 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
DataBase Normalization

Yeah, I know, groan... But I have a question I would like people's opinion on. Rather than use one of the examples everyone always uses to illustrate normalization, I am going to use a real world example. Now, I'm building a database to hold all the major league baseball statistics. In order to avoid confusion where 2 players have the same name, I will give every player a unique id number, just an integer from 1 to however many, no relation to anything in the real world. Now I have a choice. I can store every player in a list called active_players_2004, for example, and use a team name field to tell me what team he is on. This way I could pull out those player who played for the Cubs, let's say. But, of course, this would result in the value "Chicago Cubs" being saved multiple times in the table. Let's also pretend that giving each team a number and storing the number instead of the name because it is a "smaller" data type is dumb, how else can we fix this problem? We could, instead, have a number of tables, one for each team, for example, 2004_Chicago_Cubs, and store each player's stats, keyed by player_id, under the appropriate table. This would be great for the instance where I want to return an entire team, but suppose I want to retrieve Sammy Sosa's stats, and I don't know he plays for the Cubs? Now I have to search every team table until I find his player_id. And, I can't just stop there, because he may have been traded in mid-season and have stats with another team. So I will have to search each team table to find all occurences of his player_id, to get all of his stats for that year.

I suppose I could have a table for each team, containing just the player_ids AND an active_players_2004 table containing all the stats keyed by player_id, but that just seems like an excuse for more tables.

Anybody have any thoughts about this?

Reply With Quote
  #2  
Old March 4th, 2004, 07:56 AM
dhouston's Avatar
dhouston dhouston is offline
Contributing User
Dev Articles Beginner (1000 - 1499 posts)
 
Join Date: May 2003
Location: Tennessee
Posts: 1,355 dhouston User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 7
Send a message via ICQ to dhouston
The best design is to have a player table and a team table. Each player row contains a key corresponding to the unique id for the player's current team.

It's not that storing a key rather than a text string is better because it's small but that it's more efficient. Say the team name changes, for example. You update in one place rather than 26 and there's no room for data-entry error ("Cubs" vs. "cubs" for example).

Using normalized tables also solves a problem of saving player histories. If you had a table per team per year, you'd have a db management nightmare pretty quickly. But add a history table that stores a player id, a team id, and a signing/trade date and you've got an efficient way of tracking a player's career path. (Players with a record in this table for the current year are considered active, eliminating that extra table.) Then you have a stats table that holds player_id, (current) team_id, date, stat_type, and stat, allowing you very easily to track a player's stats over time, across teams, or for a given team.
__________________
Please don't PM me asking for solutions outside the scope of a thread.
Keeping all responses in a thread stands to help others who come along later,
which is after all what this forum's all about.

Reply With Quote
  #3  
Old March 4th, 2004, 09:12 AM
Buster77 Buster77 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 28 Buster77 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
OK, actually you have sort of pointed me the way to go, I think. So an active_player table (this is still needed since I stored Biographical info here, things that don't change much, height, weight, DOB, etc.) and a Teams table, with the team_id being a foreign key in the active_players table?

Do you really think a per year table would become a nightmare? I ask because I intended to have a batting_stats table, a pitching_stats table, and a fielding_stats table for each year. The reason for breaking the stats into 3 seperate tables is because a DH wouldn't have fielding or pitching stats, and some pitchers won't have batting stats. I also envisioned the team id as a foreign key here, because, if a guy gets traded in mid season, he could just have 2 stat lines, one for each team. You could add them together to get his overall stats for the year, but usually (on his player card for example) they are broken out seperately. One of the main reasons to do a per year with the stats (and, man, there are a lot of them....) is that I could see only keeping say the last 10 years in the DB, and pulling the older ones off into other storage (tape or CD or something).

Thanks for the help on my earlier question. I appreciate the time...

Reply With Quote
  #4  
Old March 4th, 2004, 12:05 PM
dhouston's Avatar
dhouston dhouston is offline
Contributing User
Dev Articles Beginner (1000 - 1499 posts)
 
Join Date: May 2003
Location: Tennessee
Posts: 1,355 dhouston User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 7
Send a message via ICQ to dhouston
There's still no need for an active_player table (if you're thinking of it as being in addition to a players table). You store all the height/weight/etc. info in the players table. I don't see that you gain anything by splitting the info out when you've already got a perfectly good table for holding this info; and on top of that, every time a player goes inactive, you have to move his record. If you let the history/stats tables dictate who's active (if you don't have stats in 2004, you're not active in 2004). This paragraph only applies if you were thinking in terms of having a general players table and a separate active_players table; this would just be more work.

I do think that per-year tables are a very very bad idea, especially if you go down the path of having multiple year-specific tables. You wind up having to do klugey selects to get cross-year aggregate data and you wind up with a bazillion tables, which is just untidy, especially if there's no other advantage gained.

My approach would be roughly as follows:

Code:
CREATE TABLE stat_types(
    id int(10) unsigned zerofill NOT NULL primary key auto_increment,
    label varchar(20), -- "Pitching", "Batting", "Fielding"
    active enum('Y','N')
);

CREATE TABLE stats(
    id int(10) unsigned zerofill NOT NULL primary key auto_increment,
    player_id int(10) unsigned zerofill,
    team_id int(10) unsigned zerofill, -- your inclination to have a team id in this table was correct
    stat_type int(10) unsigned zerofill, -- key to stat_types
    stat int(5), -- or float or whatever's appropriate
    year int(4),
    date_entered datetime
);


This allows you to differentiate among types of stats, to track a given stat against player or team (making it easy to get a player's aggregate stats or a *team's* aggregate stats over a given period), and to have only one concise stats table to maintain. It also makes pulling the aggregate data out much easier than pulling out of multiple tables:

Code:
SELECT AVG(stat) FROM stats, players WHERE stats.user_id=players.id AND stat_type=1 AND (stats.year > 1996 AND stats.year < 2000)


To get the same set of stats for a team rather than for a player, you just join stats.team_id to team.id instead of stats.user_id to players.id and you're done in one simple query with no messy table jockeying.

My example doesn't accommodate subtypes of stats, I guess (ie, strikeouts vs. hits or balls vs strikes within a main stat type), but that's easily enough done by adding another lookup table and a stat_subtype field in the stats table that's keyed to that lookup table. This would change your query only in that you'd have to add another restriction to look for subtype in addition to type.

As for archiving old stats, that's no problem here either. Just do a data dump and then delete from stats where year < 1994 and you're done without having to delete tables or modify code to keep it from trying to hit the deleted tables.

Reply With Quote
  #5  
Old March 5th, 2004, 12:36 AM
Buster77 Buster77 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 28 Buster77 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Thanks very much, DHouston. You have convinced me not to use the per year tables idea. I am still thinking of using seperate tables for the batting stats, the pitching stats, and the fielding stats. I understand why you suggest a seperate stat subtypes table, that way if I add a new stat, it would be easy to put it into the DB. The main reason I'm thinking of keeping the stats seperate is that I have concerns about performance. Just to illustrate, there are 36 subtypes of batting stats, 36 subtypes of pitching stats, and 10 subtypes of fielding stats. There are 16 teams in the National League, each team carries 12 pitchers(at least until roster expansion in Sept), each pitcher would have 82 seperate stat lines per year. That's 12x16x82=15,744 stat lines for the National League pitchers. Figure about the same for the American League pitchers (since we could count the DH and pitcher as the same person). Also, figure about 40,000 lines for the batters, as a conservative estimate (remember, we have to count everyone who had a single at bat), and that's 72K lines per year. I am also tracking Triple A stats, so double that to 144K lines per year. That is almost 1 and a half million lines for a ten year period. Wouldn't I pay a pretty hefty performance penalty for searching for 1 person's stats?

Another reason for keeping the stats seperate is that you never look at a combination of the stats. You always view either batting stats, pitching stats, or fielding stats.

The final reason for seperating them is that I have a client side app for entering the data, and if I add a stat, I am completely screwed already, in that I have to rewrite the client side app. rewriting the DB and the server side stuff would be the easy part...I suppose I could have a stats subtype table AND 3 seperate tables for stats, hoping that no one stat table would become more the half a million lines. I wonder how much of a performance penalty I would pay for a search then?

Anyway, I really appreciate the time you have already spent helping me. I would really like to hear your thoughts on the performance issues given the numbers I postulated, if you have time.

Thanks again,

Reply With Quote
  #6  
Old March 5th, 2004, 08:00 AM
dhouston's Avatar
dhouston dhouston is offline
Contributing User
Dev Articles Beginner (1000 - 1499 posts)
 
Join Date: May 2003
Location: Tennessee
Posts: 1,355 dhouston User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 7
Send a message via ICQ to dhouston
I'm not sure performance would really be an issue if you index your tables correctly, but it sounds like you've got other compelling reasons to keep stat types separate, so I say keep 'em separate. It's really a judgment call you have to make based on your hardware, your client-side app restrictions, etc. Glad I could help, and I hope my advice turns out to have been good advice.

Reply With Quote
  #7  
Old March 8th, 2004, 04:23 AM
Buster77 Buster77 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 28 Buster77 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Yes, thank alot DHouston. You have been a great help. Sorry I took so long to get back and say that. I had to rush outta town cause a friend had surgery.

Thanks again,

Reply With Quote
Reply

Viewing: Dev Articles Community ForumsDatabasesDatabase Development > DataBase Normalization


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

 Free IT White Papers!
 
Accelerating Trading Partner Performance
One in five. That's how many partner transactions have at least one error. That is an amazing statistic, particularly given the extraordinary leaps in innovation across the global supply chain during the past two decades. Download this white paper to learn more.

 
Competing on Analytics
This Tech Analysis is designed to help identify characteristics shared by analytics competitors, and includes information about 32 organizations that have made a commitment to quantitative, fact-based analysis.

 
Cost Effective Scaling with Virtualization and Coyote Point Systems
An overview of the industry trend toward virtualization, how server consolidation has increased the importance of application uptime and the steps being taken to integrate load balancing technology with virtualized servers.

 
Five Checkpoints to Implementing IP Telephony
Implementation planning for IP PBX software and IP telephony has become vital as businesses replace discontinued legacy PBX phone systems. This informative whitepaper outlines five &quot;checkpoints&quot; for any implementation plan that will help make IP communications a successful proposition.

 
Hosted Email Security: Staying Ahead of New Threats
In the last two years, email has become a fierce battleground between the nefarious forces of spam and malware, and the heroes of messaging protection. The spam volumes increased alarmingly every month, bringing clever new forms of phishing and virus propagation attacks.

 

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 6 hosted by Hostway