Statistics in Corpus Linguistics Research : A New Approach

책 이미지

eBook 미리보기

책 정보

· 제목 : Statistics in Corpus Linguistics Research : A New Approach (Paperback)
· 분류 : 외국도서 > 언어학 > 언어학 > 일반
· ISBN : 9781138589384
· 쪽수 : 356쪽
· 출판일 : 2020-11-23

Preface 1. Why do we need another book on statistics? 2. Statistics and scientific rigour 3. Why is statistics difficult? 4. Looking down the observer’s end of the telescope 5. What do linguists need to know about statistics? 6. Acknowledgments A note on terminology and notation Contingency tests for different purposes Part 1. Motivations 1. What might corpora tell us about language? 1. Introduction 2. What might a corpus tell us? 3. The 3A cycle 3.1 Annotation, abstraction and analysis 3.2 The problem of representational plurality 3.3 ICECUP: a platform for treebank research 4. What might a richly annotated corpus tell us? 5. External influences: modal shall / will over time 6. Interacting grammatical decisions: NP premodification 7. Framing constraints and interaction evidence 7.1 Framing frequency evidence 7.2 Framing interaction evidence 7.3 Framing and annotation 7.4 Framing and sampling 8. Conclusions Part 2. Designing Experiments with Corpora 2. The idea of corpus experiments 1. Introduction 2. Experimentation and observation 2.1 Obtaining data 2.2 Research questions and hypotheses 2.3 From hypothesis to experiment 3. Evaluating a hypothesis 3.1 The chi-square test 3.2 Extracting data 3.3 Visualising proportions, probabilities and significance 4. Refining the experiment 5. Correlations and causes 6. A linguistic interaction experiment 7. Experiments and disproof 8. What is the purpose of an experiment? 9. Conclusions 3. That vexed problem of choice 1. Introduction 1.1 The traditional ‘per million words’ approach 1.2 How did per million word statistics become dominant? 1.3 Choice models and linguistic theory 1.4 The vexed problem of choice 1.5 Exposure rates and other experimental models 1.6 What do we mean by ‘choice’? 2. Parameters of choice 2.1 Types of mutual substitution 2.2 Multi-way choices and decision trees 2.3 Binomial statistics, tests and time series 2.4 Lavandera’s dangerous hypothesis 3. A methodological progression? 3.1 Per million words 3.2 Selecting a more plausible baseline 3.3 Enumerating alternates 3.4 Linguistically restricting the sample 3.5 Eliminating non-alternating cases 3.6 A methodological progression 4. Objections to variationism 4.1 Feasibility 4.2 Arbitrariness 4.3 Oversimplification 4.4 The problem of polysemy 4.5 A complex ecology? 4.6 Necessary reductionism versus complex statistical models 4.7 Discussion 5. Conclusions 4. Choice versus meaning 1. Introduction 2. The meaning of very 3. The choice of very 4. Refining baselines by type 5. Conclusions 5. Balanced samples and imagined populations 1. Introduction 2. A study in genre variation 3. Imagining populations 4. Multi-variate and multi-level modeling 5. More texts ? or longer ones? 6. Conclusions Part 3. Confidence intervals and significance tests 6. Introducing inferential statistics 1. Why is statistics difficult? 2. The idea of inferential statistics 3. The randomness of life 3.1 The Binomial distribution 3.2 The ideal Binomial distribution 3.3 Skewed distributions 3.4 From Binomial to Normal 3.5 From Gauss to Wilson 3.6 Scatter and confidence 4. Conclusions 7. Plotting with confidence 1. Introduction 1.1 Visualising data 1.2 Comparing observations and identifying significant differences 2. Plotting the graph 2.1 Step 1. Gather raw data 2.2 Step 2. Calculate basic Wilson score interval terms 2.3 Step 3. Calculate the Wilson interval 2.4 Step 4. Plotting intervals on graphs 3. Comparing and plotting change 3.1 The Newcombe-Wilson interval 3.2 Comparing intervals: an illustration 3.3 What does the Newcombe-Wilson interval represent? 3.4 Comparing multiple points 3.5 Plotting percentage difference 3.6 Floating bar charts 4. An apparent paradox 5. Conclusions 8. From intervals to tests 1. Introduction 1.1 Binomial intervals and tests 1.2 Sampling assumptions 1.3 Deriving a Binomial distribution 1.4 Some example data 2. Tests for a single Binomial proportion 2.1 The single-sample z test 2.2 The 2 × 1 goodness of fit c 2 test 2.3 The Wilson score interval 2.4 Correcting for continuity 2.5 The ‘exact’ Binomial test 2.6 The Clopper-Pearson interval 2.7 The log-likelihood test 2.8 A simple performance comparison 3. Tests for comparing two observed proportions 3.1 The 2 × 2 c 2 and z test for two independent proportions 3.2 The z test for two independent proportions from independent populations 3.3 The z test for two independent proportions with a given difference in population means 3.4 Continuity-corrected 2 × 2 tests 3.5 The Fisher ‘exact’ test 4. Applying contingency tests 4.1 Selecting tests 4.2 Analysing larger tables 4.3 Linguistic choice 4.4 Case interaction 4.5 Large samples and small populations 5. Comparing the results of experiments 6. Conclusions 9. Comparing frequencies in the same distribution 1. Introduction 2. The single sample z test 2.1 Comparing frequency pairs for significant difference 2.2 Performing the test 3. Testing and interpreting intervals 3.1 The Wilson comparison heuristic 3.2 Visualising the test 4. Conclusions 10. Reciprocating the Wilson interval 1. Introduction 2. The Wilson interval of mean utterance length 2.1 Scatter and confidence 2.2 From length to proportion 2.3 An example: confidence intervals on mean length of utterance 2.4 Plotting the results 3. Intervals on monotonic functions of p 4. Conclusions 11. Competition between choices over time 1. Introduction 2. The ‘S curve’ 3. Boundaries and confidence intervals 3.1 Confidence intervals for p 3.2 Logistic curves and Wilson intervals 4. Logistic regression 4.1 From linear to logistic regression 4.2 Logit-Wilson regression 4.3 Example 1: The decline of the to-infinitive perfect 4.3 Example 2: Catenative verbs in competition 4.4 Review 5. Impossible logistic Multinomials 5.1 Binomials 5.2 Impossible Multinomials 5.3 Possible hierarchical Multinomials 5.4 A hierarchical reanalysis of Example 2 5.5 The three-body problem 6. Conclusions 12. The replication crisis and the New Statistics 1. Introduction 2. A corpus linguistics debate 3. Psychology lessons? 4. The road not travelled 5. What does this mean for corpus linguistics? 6. Some recommendations 6.1 Recommendation 1: include a replication step 6.2 Recommendation 2: focus on large effects ? and clear visualisations 6.3 Recommendation 3: play devil’s advocate 6.4 A checklist for empirical linguistics 7. Conclusions 13. Choosing the right test 1. Introduction 1.1 Choosing a dependent variable and baselines 1.2 Choosing independent variables 2. Tests for categorical data 2.1 Two types of contingency test 2.2 The benefits of simple tests 2.3 Visualising uncertainty 2.4 When to use goodness of fit tests 2.5 Tests for comparing results 2.6 Optimum methods of calculation 3. Tests for other types of data 3.1 t tests for comparing two independent samples of numeric data 3.2 Reversing tests 3.3 Tests for other types of variables 3.4 Quantisation 4. Conclusions Part 4. Effect sizes and meta-tests 14. The size of an effect 1. Introduction 2. Effect sizes for two-variable tables 2.1 Simple difference 2.2 The problem of prediction 2.3 Cramer’s ? 2.4 Other probabilistic approaches to dependent probability 3. Confidence intervals on ? 3.1 Confidence intervals on 2 × 2 ? 3.2 Confidence intervals for Cramer’s f 3.3 An example: Investigating grammatical priming 4. Goodness of fit effect sizes 4.1 Unweighted f p 4.2 Variance-weighted f e 4.3 Example: Correlating the present perfect 5. Conclusions 15. Meta-tests for comparing tables of results 1. Introduction 1.1 How not to compare test results 1.2 Comparing sizes of effect 1.3 Other meta-tests 2. Some preliminaries 2.1 Test assumptions 2.2 Statistical principles and correcting for continuity 2.3 Example data and notation 3. Point and multi-point tests for homogeneity tables 3.1 Reorganising contingency tables for 2 × 1 tests 3.2 The Newcombe-Wilson point test 3.3 The Gaussian point test 3.4 The multi-point test for r × c homogeneity tables 4. Gradient tests for homogeneity tables 4.1 The 2 × 2 Newcombe-Wilson gradient test 4.2 Cramer’s ? interval and test 4.3 r × 2 homogeneity gradient tests 4.4 Interpreting gradient meta-tests for large tables 5. Gradient tests for goodness of fit tables 5.1 The 2 ´ 1 Wilson interval gradient test 5.2 r × 1 goodness of fit gradient tests 6. Subset tests 6.1 Point tests for subsets 6.2 Multi-point subset tests 6.3 Gradient subset tests 6.4 Goodness of fit subset tests 7. Conclusions Part 5. Statistical solutions for corpus samples 16. Conducting research with imperfect data 1. Introduction 2. Reviewing subsamples 2.1 Example 1: get vs. be passive 2.2 Subsampling and reviewing 2.3 Estimating the observed probability p 2.3 Performing a contingency test and extending to Multinomial dependent variables 3. Reviewing preliminary analyses 3.1 Example 2: embedded and sequential postmodifiers 3.2 Testing the worst-case scenario 3.3 Combining subsampling with worst-case analysis 2.4 Ambiguity and error 4. Resampling and p-hacking 5. Conclusions 17. Adjusting intervals for random-text samples 1. Introduction 2. Recalibrating Binomial models 3. Examples with large samples 3.1 Example 1: interrogative clause probability, ‘direct conversations’ 3.2 Example 2: clauses per word, ‘direct conversations’ 3.3 Uneven-size subsamples 3.4 Example 1 revisited, across ICE-GB 4. Alternation studies with small samples 4.1 Applying the method 4.2 Singletons, partitioning and pooling 4.3 Discussion 5. Conclusions Part 6. Concluding remarks 18. Plotting the Wilson distribution 1. Introduction 2. Plotting the distribution 2.1 Calculating w?(a ) from the standard Normal distribution 2.2 Plotting points 2.3 Employing a delta approximation 3. Example plots 3.1 Sample size n = 10, observed proportion p = 0.5 3.2 Properties of Wilson areas 3.3 The effect of p tending to extremes 3.4 The effect of very small n 4. Further perspectives on Wilson distributions 4.1 Percentiles of the Wilson distributions 4.2 The logit Wilson distribution 5. Alternative distributions 5.1 Continuity-corrected Wilson distributions 5.2 Clopper-Pearson distributions 6. Conclusions 19. In conclusion Appendices 1. The interval equality principle 1. Introduction 1.1 Axiom 1.2 Functional notation 2. Applications 2.1 Wilson score interval 2.2 Wilson score interval with continuity-correction 2.3 Binomial 2.4 Log-likelihood and other significance test functions 3. Searching for interval bounds with a computer 2. Pseudo-code for computational procedures 1. Simple logistic regression algorithm with logit Wilson variance 2.1 Calculate error sum of squared errors e for known m and k 2.2 Find optimum value of k by search for smallest error e for a given gradient m 2.3 Find optimum values of m and k by method of least squares, also return error e 2.4 Perform regression 2. Binomial and Fisher functions 3.1 Core functions 3.2 The Clopper-Pearson interval Glossary References