Database Development
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
User Name:
Password:
Remember me
 
Go Back   Dev Articles Community ForumsDatabasesDatabase Development

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Display Modes
 
Unread Dev Articles Community Forums Sponsor:
  #1  
Old March 4th, 2004, 01:57 AM
Buster77 Buster77 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 28 Buster77 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
DataBase Normalization

Yeah, I know, groan... But I have a question I would like people's opinion on. Rather than use one of the examples everyone always uses to illustrate normalization, I am going to use a real world example. Now, I'm building a database to hold all the major league baseball statistics. In order to avoid confusion where 2 players have the same name, I will give every player a unique id number, just an integer from 1 to however many, no relation to anything in the real world. Now I have a choice. I can store every player in a list called active_players_2004, for example, and use a team name field to tell me what team he is on. This way I could pull out those player who played for the Cubs, let's say. But, of course, this would result in the value "Chicago Cubs" being saved multiple times in the table. Let's also pretend that giving each team a number and storing the number instead of the name because it is a "smaller" data type is dumb, how else can we fix this problem? We could, instead, have a number of tables, one for each team, for example, 2004_Chicago_Cubs, and store each player's stats, keyed by player_id, under the appropriate table. This would be great for the instance where I want to return an entire team, but suppose I want to retrieve Sammy Sosa's stats, and I don't know he plays for the Cubs? Now I have to search every team table until I find his player_id. And, I can't just stop there, because he may have been traded in mid-season and have stats with another team. So I will have to search each team table to find all occurences of his player_id, to get all of his stats for that year.

I suppose I could have a table for each team, containing just the player_ids AND an active_players_2004 table containing all the stats keyed by player_id, but that just seems like an excuse for more tables.

Anybody have any thoughts about this?

Reply With Quote
  #2  
Old March 4th, 2004, 07:56 AM
dhouston's Avatar
dhouston dhouston is offline
Contributing User
Dev Articles Beginner (1000 - 1499 posts)
 
Join Date: May 2003
Location: Tennessee
Posts: 1,355 dhouston User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 8
Send a message via ICQ to dhouston
The best design is to have a player table and a team table. Each player row contains a key corresponding to the unique id for the player's current team.

It's not that storing a key rather than a text string is better because it's small but that it's more efficient. Say the team name changes, for example. You update in one place rather than 26 and there's no room for data-entry error ("Cubs" vs. "cubs" for example).

Using normalized tables also solves a problem of saving player histories. If you had a table per team per year, you'd have a db management nightmare pretty quickly. But add a history table that stores a player id, a team id, and a signing/trade date and you've got an efficient way of tracking a player's career path. (Players with a record in this table for the current year are considered active, eliminating that extra table.) Then you have a stats table that holds player_id, (current) team_id, date, stat_type, and stat, allowing you very easily to track a player's stats over time, across teams, or for a given team.
__________________
Please don't PM me asking for solutions outside the scope of a thread.
Keeping all responses in a thread stands to help others who come along later,
which is after all what this forum's all about.

Reply With Quote
  #3  
Old March 4th, 2004, 09:12 AM
Buster77 Buster77 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 28 Buster77 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
OK, actually you have sort of pointed me the way to go, I think. So an active_player table (this is still needed since I stored Biographical info here, things that don't change much, height, weight, DOB, etc.) and a Teams table, with the team_id being a foreign key in the active_players table?

Do you really think a per year table would become a nightmare? I ask because I intended to have a batting_stats table, a pitching_stats table, and a fielding_stats table for each year. The reason for breaking the stats into 3 seperate tables is because a DH wouldn't have fielding or pitching stats, and some pitchers won't have batting stats. I also envisioned the team id as a foreign key here, because, if a guy gets traded in mid season, he could just have 2 stat lines, one for each team. You could add them together to get his overall stats for the year, but usually (on his player card for example) they are broken out seperately. One of the main reasons to do a per year with the stats (and, man, there are a lot of them....) is that I could see only keeping say the last 10 years in the DB, and pulling the older ones off into other storage (tape or CD or something).

Thanks for the help on my earlier question. I appreciate the time...

Reply With Quote
  #4  
Old March 4th, 2004, 12:05 PM
dhouston's Avatar
dhouston dhouston is offline
Contributing User
Dev Articles Beginner (1000 - 1499 posts)
 
Join Date: May 2003
Location: Tennessee
Posts: 1,355 dhouston User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 8
Send a message via ICQ to dhouston
There's still no need for an active_player table (if you're thinking of it as being in addition to a players table). You store all the height/weight/etc. info in the players table. I don't see that you gain anything by splitting the info out when you've already got a perfectly good table for holding this info; and on top of that, every time a player goes inactive, you have to move his record. If you let the history/stats tables dictate who's active (if you don't have stats in 2004, you're not active in 2004). This paragraph only applies if you were thinking in terms of having a general players table and a separate active_players table; this would just be more work.

I do think that per-year tables are a very very bad idea, especially if you go down the path of having multiple year-specific tables. You wind up having to do klugey selects to get cross-year aggregate data and you wind up with a bazillion tables, which is just untidy, especially if there's no other advantage gained.

My approach would be roughly as follows:

Code:
CREATE TABLE stat_types(
    id int(10) unsigned zerofill NOT NULL primary key auto_increment,
    label varchar(20), -- "Pitching", "Batting", "Fielding"
    active enum('Y','N')
);

CREATE TABLE stats(
    id int(10) unsigned zerofill NOT NULL primary key auto_increment,
    player_id int(10) unsigned zerofill,
    team_id int(10) unsigned zerofill, -- your inclination to have a team id in this table was correct
    stat_type int(10) unsigned zerofill, -- key to stat_types
    stat int(5), -- or float or whatever's appropriate
    year int(4),
    date_entered datetime
);


This allows you to differentiate among types of stats, to track a given stat against player or team (making it easy to get a player's aggregate stats or a *team's* aggregate stats over a given period), and to have only one concise stats table to maintain. It also makes pulling the aggregate data out much easier than pulling out of multiple tables:

Code:
SELECT AVG(stat) FROM stats, players WHERE stats.user_id=players.id AND stat_type=1 AND (stats.year > 1996 AND stats.year < 2000)


To get the same set of stats for a team rather than for a player, you just join stats.team_id to team.id instead of stats.user_id to players.id and you're done in one simple query with no messy table jockeying.

My example doesn't accommodate subtypes of stats, I guess (ie, strikeouts vs. hits or balls vs strikes within a main stat type), but that's easily enough done by adding another lookup table and a stat_subtype field in the stats table that's keyed to that lookup table. This would change your query only in that you'd have to add another restriction to look for subtype in addition to type.

As for archiving old stats, that's no problem here either. Just do a data dump and then delete from stats where year < 1994 and you're done without having to delete tables or modify code to keep it from trying to hit the deleted tables.

Reply With Quote
  #5  
Old March 5th, 2004, 12:36 AM
Buster77 Buster77 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 28 Buster77 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Thanks very much, DHouston. You have convinced me not to use the per year tables idea. I am still thinking of using seperate tables for the batting stats, the pitching stats, and the fielding stats. I understand why you suggest a seperate stat subtypes table, that way if I add a new stat, it would be easy to put it into the DB. The main reason I'm thinking of keeping the stats seperate is that I have concerns about performance. Just to illustrate, there are 36 subtypes of batting stats, 36 subtypes of pitching stats, and 10 subtypes of fielding stats. There are 16 teams in the National League, each team carries 12 pitchers(at least until roster expansion in Sept), each pitcher would have 82 seperate stat lines per year. That's 12x16x82=15,744 stat lines for the National League pitchers. Figure about the same for the American League pitchers (since we could count the DH and pitcher as the same person). Also, figure about 40,000 lines for the batters, as a conservative estimate (remember, we have to count everyone who had a single at bat), and that's 72K lines per year. I am also tracking Triple A stats, so double that to 144K lines per year. That is almost 1 and a half million lines for a ten year period. Wouldn't I pay a pretty hefty performance penalty for searching for 1 person's stats?

Another reason for keeping the stats seperate is that you never look at a combination of the stats. You always view either batting stats, pitching stats, or fielding stats.

The final reason for seperating them is that I have a client side app for entering the data, and if I add a stat, I am completely screwed already, in that I have to rewrite the client side app. rewriting the DB and the server side stuff would be the easy part...I suppose I could have a stats subtype table AND 3 seperate tables for stats, hoping that no one stat table would become more the half a million lines. I wonder how much of a performance penalty I would pay for a search then?

Anyway, I really appreciate the time you have already spent helping me. I would really like to hear your thoughts on the performance issues given the numbers I postulated, if you have time.

Thanks again,

Reply With Quote
  #6  
Old March 5th, 2004, 08:00 AM
dhouston's Avatar
dhouston dhouston is offline
Contributing User
Dev Articles Beginner (1000 - 1499 posts)
 
Join Date: May 2003
Location: Tennessee
Posts: 1,355 dhouston User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 8
Send a message via ICQ to dhouston
I'm not sure performance would really be an issue if you index your tables correctly, but it sounds like you've got other compelling reasons to keep stat types separate, so I say keep 'em separate. It's really a judgment call you have to make based on your hardware, your client-side app restrictions, etc. Glad I could help, and I hope my advice turns out to have been good advice.

Reply With Quote
  #7  
Old March 8th, 2004, 04:23 AM
Buster77 Buster77 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 28 Buster77 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Yes, thank alot DHouston. You have been a great help. Sorry I took so long to get back and say that. I had to rush outta town cause a friend had surgery.

Thanks again,

Reply With Quote
Reply

Viewing: Dev Articles Community ForumsDatabasesDatabase Development > DataBase Normalization


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump




 Free IT White Papers!
 
Create the Optimal Architecture for your Critical Applications
Warburton's the largest independently owned bakery in the UK faced a number of difficult challenges in providing the most robust yet efficient IT infrastructure for their organization's success. IBM's services combined with their xSeries servers created the perfect platform for their SAP environment with sufficient flexibility, and did so in very time effective fashion.

Request Your Free Technology Downloads!
 
Five Best Practices for Deploying a Successful Service-Oriented Architecture
This white paper describes the benefits you can expect with SOA, and how IBM can help take your business there.

Request Your Free Technology Downloads!
 
Gartner Magic Quadrant for Application Delivery Controllers
Gartner summarizes its view on Application Delivery Controllers, evaluates strengths and weaknesses of solutions, and provides Magic Quadrant reporting for a quick comparison across all vendors. Learn from Gartner how you can benefit from an all-in-one device like Citrix NetScaler that delivers the highest levels of availability, performance and security.

Request Your Free Technology Downloads!
 
Knowledge is Power
What you don't know can hurt you, and is likely costing you money and increasing your security risks during an era of scarce resources. This white paper proposes six key strategies that enterprise security managers can use to improve their network defense posture.

Request Your Free Technology Downloads!
 
Rationalizing the Multi-Tool Environment
The rationalized multi-tool approach is flexible, scalable and cost effective. It provides the necessary input to the IT service management business processes. It preserves prior investments in monitoring tools, empowers technologists to select the best tools with which to do their jobs, and enhances effective response to incidents.

Request Your Free Technology Downloads!
 

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 




© 2003-2010 by Developer Shed. All rights reserved. DS Cluster 5 Hosted by Hostway
For more Enterprise Application Development news, visit eWeek