책 이미지
책 정보
· 분류 : 외국도서 > 과학/수학/생태 > 수학 > 확률과 통계 > 일반
· ISBN : 9781119549840
· 쪽수 : 608쪽
목차
Foreword by Gareth James xix
Foreword by Ravi Bapna xxi
Preface to the Python Edition xxiii
Acknowledgments xxvii
Part I Preliminaries
Chapter 1 Introduction 3
1.1 What Is Business Analytics? 3
1.2 What Is Data Mining? 5
1.3 Data Mining and Related Terms 5
1.4 Big Data 6
1.5 Data Science 7
1.6 Why Are There So Many Different Methods? 8
1.7 Terminology and Notation 9
1.8 Road Maps to This Book 11
Order of Topics 11
Chapter 2 Overview of the Data Mining Process 15
2.1 Introduction 15
2.2 Core Ideas in Data Mining 16
Classification 16
Prediction 16
Association Rules and Recommendation Systems 16
Predictive Analytics 17
Data Reduction and Dimension Reduction 17
Data Exploration and Visualization 17
Supervised and Unsupervised Learning 18
2.3 The Steps in Data Mining 19
2.4 Preliminary Steps 21
Organization of Datasets 21
Predicting Home Values in the West Roxbury Neighborhood 21
Loading and Looking at the Data in Python 22
Python imports 25
Sampling from a Database 26
Oversampling Rare Events in Classification Tasks 26
Preprocessing and Cleaning the Data 27
2.5 Predictive Power and Overfitting 34
Overfitting 34
Creation and Use of Data Partitions 36
2.6 Building a Predictive Model 40
Modeling Process 40
2.7 Using Python for Data Mining on a Local Machine 45
2.8 Automating Data Mining Solutions 46
2.9 Ethical Practice in Data Mining1 47
Data Mining Software: The State of the Market (by Herb Edelstein) 52
Problems 56
Part II Data Exploration and Dimension Reduction
Chapter 3 Data Visualization 61
3.1 Uses of Data Visualization 61
Python 63
3.2 Data Examples 64
Example 1: Boston Housing Data 64
Example 2: Ridership on Amtrak Trains 65
3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots 66
Distribution Plots: Boxplots and Histograms 68
Heatmaps: Visualizing Correlations and Missing Values 72
3.4 Multidimensional Visualization 75
Adding Variables: Color, Size, Shape, Multiple Panels, and Animation 75
Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering 77
Reference: Trend Lines and Labels 83
Scaling up to Large Datasets 84
Multivariate Plot: Parallel Coordinates Plot 84
Interactive Visualization 86
3.5 Specialized Visualizations 89
Visualizing Networked Data 90
Visualizing Hierarchical Data: Treemaps 92
Visualizing Geographical Data: Map Charts 94
3.6 Summary: Major Visualizations and Operations, by Data Mining Goal 97
Prediction 97
Classification 97
Time Series Forecasting 97
Unsupervised Learning 98
Problems 99
Chapter 4 Dimension Reduction 101
Python 101
4.1 Introduction 102
4.2 Curse of Dimensionality 102
4.3 Practical Considerations 103
Example 1: House Prices in Boston 104
4.4 Data Summaries 104
Summary Statistics 104
Aggregation and Pivot Tables 106
4.5 Correlation Analysis 108
4.6 Reducing the Number of Categories in Categorical Variables 110
4.7 Converting a Categorical Variable to a Numerical Variable 111
4.8 Principal Components Analysis 111
Example 2: Breakfast Cereals 111
Principal Components 116
Normalizing the Data 117
Using Principal Components for Classification and Prediction 120
4.9 Dimension Reduction Using Regression Models 122
4.10 Dimension Reduction Using Classification and Regression Trees 122
Problems 123
Part III Performance Evaluation
Chapter 5 Evaluating Predictive Performance 129
Python 129
5.1 Introduction 130
5.2 Evaluating Predictive Performance 130
Naive Benchmark: The Average 131
Prediction Accuracy Measures 131
Comparing Training and Validation Performance 132
Cumulative Gains and Lift Charts 135
5.3 Judging Classifier Performance 136
Benchmark: The Naive Rule 136
Class Separation 138
The Confusion (Classification) Matrix 139
Using the Validation Data 140
Accuracy Measures 140
Propensities and Cutoff for Classification 141
Performance in Case of Unequal Importance of Classes 143
Asymmetric Misclassification Costs 147
Generalization to More Than Two Classes 149
5.4 Judging Ranking Performance 150
Gains and Lift Charts for Binary Data 150
Decile Lift Charts 153
Beyond Two Classes 154
Gains and Lift Charts Incorporating Costs and Benefits 154
Cumulative Gains as a Function of Cutoff 154
5.5 Oversampling 155
Oversampling the Training Set 158
Evaluating Model Performance Using a Non-oversampled Validation Set 158
Evaluating Model Performance if Only Oversampled Validation Set Exists 158
Problems 161
Part IV Prediction and Classification Methods
Chapter 6 Multiple Linear Regression 167
Python 167
6.1 Introduction 168
6.2 Explanatory vs. Predictive Modeling 168
6.3 Estimating the Regression Equation and Prediction 170
Example: Predicting the Price of Used Toyota Corolla Cars 171
6.4 Variable Selection in Linear Regression 176
Reducing the Number of Predictors 176
How to Reduce the Number of Predictors 177
Regularization (Shrinkage Models) 183
Problems 187
Chapter 7 k-Nearest Neighbors (kNN) 191
Python 191
7.1 The k-NN Classifier (Categorical Outcome) 192
Determining Neighbors 192
Classification Rule 193
Example: Riding Mowers 193
Choosing k 195
Setting the Cutoff Value 197
k-NN with More Than Two Classes 200
Converting Categorical Variables to Binary Dummies 200
7.2 k-NN for a Numerical Outcome 200
7.3 Advantages and Shortcomings of k-NN Algorithms 202
Problems 204
Chapter 8 The Naive Bayes Classifier 207
Python 207
8.1 Introduction 207
Cutoff Probability Method 208
Conditional Probability 208
Example 1: Predicting Fraudulent Financial Reporting 209
8.2 Applying the Full (Exact) Bayesian Classifier 210
Using the “Assign to the Most Probable Class” Method 210
Using the Cutoff Probability Method 210
Practical Difficulty with the Complete (Exact) Bayes Procedure 210
Solution: Naive Bayes 211
The Naive Bayes Assumption of Conditional Independence 212
Using the Cutoff Probability Method 213
Example 2: Predicting Fraudulent Financial Reports, Two Predictors 213
Example 3: Predicting Delayed Flights 214
8.3 Advantages and Shortcomings of the Naive Bayes Classifier 221
Problems 223
Chapter 9 Classification and Regression Trees 225
Python 225
9.1 Introduction 226
Tree Structure 226
Decision Rules 227
Classifying a New Record 228
9.2 Classification Trees 228
Recursive Partitioning 228
Example 1: Riding Mowers 229
Measures of Impurity 231
9.3 Evaluating the Performance of a Classification Tree 237
Example 2: Acceptance of Personal Loan 237
Sensitivity Analysis Using Cross Validation 239
9.4 Avoiding Overfitting 242
Stopping Tree Growth 242
Fine-tuning Tree Parameters 244
Other Methods for Limiting Tree Size 247
9.5 Classification Rules from Trees 248
9.6 Classification Trees for More Than Two Classes 249
9.7 Regression Trees 249
Prediction 252
Measuring Impurity 252
Evaluating Performance 252
9.8 Improving Prediction: Random Forests and Boosted Trees 253
Random Forests 253
Boosted Trees 255
9.9 Advantages and Weaknesses of a Tree 256
Problems 259
Chapter 10 Logistic Regression 263
Python 263
10.1 Introduction 264
10.2 The Logistic Regression Model 265
10.3 Example: Acceptance of Personal Loan 267
Model with a Single Predictor 267
Estimating the Logistic Model from Data: Computing Parameter Estimates 269
Interpreting Results in Terms of Odds (for a Profiling Goal) 272
10.4 Evaluating Classification Performance 273
Variable Selection 276
10.5 Logistic Regression for Multi-class Classification 276
Ordinal Classes 277
Nominal Classes 278
Comparing Ordinal and Nominal Models 279
10.6 Example of Complete Analysis: Predicting Delayed Flights 281
Data Preprocessing 284
Model Training 285
Model Interpretation 285
Model Performance 285
Variable Selection 288
Problems 294
Chapter 11 Neural Nets 297
Python 297
11.1 Introduction 298
11.2 Concept and Structure of a Neural Network 298
11.3 Fitting a Network to Data 299
Example 1: Tiny Dataset 299
Computing Output of Nodes 301
Preprocessing the Data 303
Training the Model 304
Example 2: Classifying Accident Severity 308
Avoiding Overfitting 311
Using the Output for Prediction and Classification 311
11.4 Required User Input 312
11.5 Exploring the Relationship Between Predictors and Outcome 313
11.6 Deep Learning 313
Convolutional neural networks (CNNs) 314
Local feature map 316
A Hierarchy of Features 316
The Learning Process 316
Unsupervised Learning 317
Conclusion 318
11.7 Advantages and Weaknesses of Neural Networks 319
Problems 321
Chapter 12 Discriminant Analysis 323
Python 323
12.1 Introduction 324
Example 1: Riding Mowers 324
Example 2: Personal Loan Acceptance 324
12.2 Distance of a Record from a Class 325
12.3 Fisher’s Linear Classification Functions 328
12.4 Classification Performance of Discriminant Analysis 331
12.5 Prior Probabilities 333
12.6 Unequal Misclassification Costs 333
12.7 Classifying More Than Two Classes 335
Example 3: Medical Dispatch to Accident Scenes 335
12.8 Advantages and Weaknesses 338
Problems 339
Chapter 13 Combining Methods: Ensembles and Uplift Modeling 343
Python 343
13.1 Ensembles 344
Why Ensembles Can Improve Predictive Power 345
Simple Averaging 346
Bagging 347
Boosting 347
Bagging and Boosting in Python 348
Advantages and Weaknesses of Ensembles 348
13.2 Uplift (Persuasion) Modeling 350
A-B Testing 350
Uplift 350
Gathering the Data 351
A Simple Model 352
Modeling Individual Uplift 353
Computing Uplift with Python 355
Using the Results of an Uplift Model 355
13.3 Summary 355
Problems 357
Part V Mining Relationships Among Records
Chapter 14 Association Rules and Collaborative Filtering 361
Python 361
14.1 Association Rules 362
Discovering Association Rules in Transaction Databases 362
Example 1: Synthetic Data on Purchases of Phone Faceplates 363
Generating Candidate Rules 363
The Apriori Algorithm 366
Selecting Strong Rules 366
Data Format 368
The Process of Rule Selection 369
Interpreting the Results 370
Rules and Chance 372
Example 2: Rules for Similar Book Purchases 374
14.2 Collaborative Filtering 376
Data Type and Format 376
Example 3: Netflix Prize Contest 377
User-Based Collaborative Filtering: “People Like You” 378
Item-Based Collaborative Filtering 381
Advantages and Weaknesses of Collaborative Filtering 381
Collaborative Filtering vs. Association Rules 384
14.3 Summary 385
Problems 387
Chapter 15 Cluster Analysis 391
Python 391
15.1 Introduction 392
Example: Public Utilities 393
15.2 Measuring Distance Between Two Records 395
Euclidean Distance 396
Normalizing Numerical Measurements 397
Other Distance Measures for Numerical Data 398
Distance Measures for Categorical Data 400
Distance Measures for Mixed Data 400
15.3 Measuring Distance Between Two Clusters 401
Minimum Distance 401
Maximum Distance 401
Average Distance 401
Centroid Distance 401
15.4 Hierarchical (Agglomerative) Clustering 403
Single Linkage 404
Complete Linkage 404
Average Linkage 405
Centroid Linkage 405
Ward’s Method 405
Dendrograms: Displaying Clustering Process and Results 406
Validating Clusters 408
Limitations of Hierarchical Clustering 409
15.5 Non-Hierarchical Clustering: The k-Means Algorithm 411
Choosing the Number of Clusters (k) 412
Problems 418
Part VI Forecasting Time Series
Chapter 16 Handling Time Series 423
Python 423
16.1 Introduction 424
16.2 Descriptive vs. Predictive Modeling 425
16.3 Popular Forecasting Methods in Business 425
Combining Methods 426
16.4 Time Series Components 426
Example: Ridership on Amtrak Trains 427
16.5 Data-Partitioning and Performance Evaluation 431
Benchmark Performance: Naive Forecasts 432
Generating Future Forecasts 434
Problems 436
Chapter 17 Regression-Based Forecasting 439
Python 439
17.1 A Model with Trend 440
Linear Trend 440
Exponential Trend 444
Polynomial Trend 444
17.2 A Model with Seasonality 447
17.3 A Model with Trend and Seasonality 449
17.4 Autocorrelation and ARIMA Models 451
Computing Autocorrelation 451
Improving Forecasts by Integrating Autocorrelation Information 454
Evaluating Predictability 456
Problems 459
Chapter 18 Smoothing Methods 469
Python 469
18.1 Introduction 470
18.2 Moving Average 470
Centered Moving Average for Visualization 470
Trailing Moving Average for Forecasting 471
Choosing Window Width (w) 475
18.3 Simple Exponential Smoothing 475
Choosing Smoothing Parameter α 476
Relation Between Moving Average and Simple Exponential Smoothing 477
18.4 Advanced Exponential Smoothing 479
Series with a Trend 479
Series with a Trend and Seasonality 480
Series with Seasonality (No Trend) 480
Problems 483
Part VII Data Analytics
Chapter 19 Social Network Analytics 493
Python 493
19.1 Introduction 494
19.2 Directed vs. Undirected Networks 495
19.3 Visualizing and Analyzing Networks 495
Plot Layout 498
Edge List 499
Adjacency Matrix 500
Using Network Data in Classification and Prediction 500
19.4 Social Data Metrics and Taxonomy 500
Node-Level Centrality Metrics 502
Egocentric Network 503
Network Metrics 503
19.5 Using Network Metrics in Prediction and Classification 507
Link Prediction 507
Entity Resolution 507
Collaborative Filtering 510
19.6 Collecting Social Network Data with Python 513
19.7 Advantages and Disadvantages 514
Problems 516
Chapter 20 Text Mining 517
Python 517
20.1 Introduction 518
20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words” 519
20.3 Bag-of-Words vs. Meaning Extraction at Document Level 519
20.4 Preprocessing the Text 521
Tokenization 521
Text Reduction 523
Presence/Absence vs. Frequency 526
Term Frequency–Inverse Document Frequency (TF-IDF) 526
From Terms to Concepts: Latent Semantic Indexing 528
Extracting Meaning 528
20.5 Implementing Data Mining Methods 529
20.6 Example: Online Discussions on Autos and Electronics 529
Importing and Labeling the Records 530
Text Preprocessing in Python 530
Producing a Concept Matrix 530
Fitting a Predictive Model 532
Prediction 532
20.7 Summary 533
Problems 534
Part VIII Cases
Chapter 21 Cases 539
21.1 Charles Book Club 539
The Book Industry 539
Database Marketing at Charles 540
Data Mining Techniques 542
Assignment 544
21.2 German Credit 545
Background 545
Data 546
Assignment 546
21.3 Tayko Software Cataloger 551
Background 551
The Mailing Experiment 551
Data 551
Assignment 553
21.4 Political Persuasion 554
Background 554
Predictive Analytics Arrives in US Politics 554
Political Targeting 555
Uplift 555
Data 556
Assignment 557
21.5 Taxi Cancellations 558
Business Situation 558
Assignment 558
21.6 Segmenting Consumers of Bath Soap 559
Business Situation 559
Key Problems 560
Data 560
Measuring Brand Loyalty 560
Assignment 562
21.7 Direct-Mail Fundraising 562
Background 562
Data 563
Assignment 564
21.8 Catalog Cross-Selling 565
Background 565
Assignment 565
21.9 Time Series Case: Forecasting Public Transportation Demand 566
Background 566
Problem Description 566
Available Data 566
Assignment Goal 567
Assignment 567
Tips and Suggested Steps 567
References 569
Data Files Used in the Book 571
Python Utilities Functions 575
Index 585