Projects
Three projects, ordered by what they ask of the data: first understand what's happening, then predict what will, then recommend what to do about it.
Do Listings Lie? How Airbnb Language Predicts Price
Text analytics across 61,521 listings in 16 cities
Premium vocabulary (“luxury,” “stunning,” “breathtaking”) correlates with nightly price three times more strongly than generic positive sentiment.
Airbnb hosts compete on price, but listing quality varies wildly. The question: does the language of a description explain price beyond property facts like bedrooms, room type, and location — or is text just noise?
After cleaning 261,894 raw listings down to 61,521 across 16 English-speaking cities, we ran n-gram analysis, TF-IDF regression, topic modelling, and sentiment scoring against log(nightly price).
Generic sentiment barely moved the needle (r = 0.08). Premium-specific vocabulary was three times stronger (r = 0.24), and that signal survived controlling for property characteristics (residual r = 0.17) — luxury language commands a price premium the property itself doesn't justify.
Why it matters
What hosts say matters far more than how positively they say it — specific, concrete language is a stronger pricing signal than tone. That's a directly actionable takeaway for hosts writing listings and for platforms screening low-effort ones.
R² on a 20% held-out test set — how much price variance each model explains.
Methodology & data+
Dataset: Public Airbnb listings dataset, 16 English-speaking cities
Tools: R · glmnet (LASSO) · tidytext · topicmodels (LDA) · sentimentr
- Cleaned 261,894 listings to 61,521 across 16 cities; log-transformed price for near-normality
- Metadata-only LASSO benchmark (8 property variables): R² = 0.569
- Text-only TF-IDF LASSO (1,415 selected words): R² = 0.446
- Combined model (metadata + TF-IDF + LDA topics): R² = 0.660 — a 9-point lift over metadata alone
- Trained on London (11,394 listings), tested cross-city generalisation on 15 held-out cities
Psychological Distress & Physical Multimorbidity in England
Negative binomial regression on Health Survey for England 2022 (n = 4,735)
Each one-point rise in psychological distress predicts a 9.7% increase in physical health conditions — and that effect is statistically identical in the richest and poorest neighbourhoods.
NHS policy increasingly treats multimorbidity as a deprivation problem. This project asked a sharper question: does area deprivation change how strongly psychological distress predicts physical illness, or do the two just operate side by side?
Using GHQ-12 distress scores and grouped condition counts from the Health Survey for England 2022, we fit a negative binomial regression (justified over Poisson by a likelihood-ratio test, p < 0.001) across four nested model specifications with full demographic, socioeconomic, and lifestyle controls.
The distress → multimorbidity association held at IRR = 1.097 (95% CI [1.085, 1.108], p < 0.001) whether or not deprivation controls were included. When we tested distress × deprivation interaction terms directly, none were significant (joint Wald p = 0.877) — deprivation raises the baseline burden, but doesn't amplify the distress effect.
Why it matters
The natural assumption is that distress and deprivation compound each other. They don't interact — they add. That argues for universal mental-health screening in physical care pathways, not just screening targeted at deprived areas.
Predicted physical condition count by distress score — the lines run parallel, not divergent, showing deprivation shifts the baseline but not the slope.
Methodology & data+
Dataset: Health Survey for England 2022 (UK Data Service, Study Number 9469)
Tools: Python · statsmodels · Negative binomial GLM · Survey weighting
- Analytical sample: 4,735 adults (complete-case), Health Survey for England 2022, UKDS SN 9469
- Outcome overdispersed (variance/mean = 1.53) → negative binomial preferred over Poisson (LR = 122.68, p < 0.001)
- 4 nested models: GHQ-12 only → + demographics → + full controls → + GHQ×IMD interaction
- Preferred spec (Model 3): IRR = 1.097 for GHQ-12, robust to OLS, logistic, and BMI-adjusted checks
- GHQ×IMD interaction jointly insignificant (χ²(4) = 1.21, p = 0.877) — no effect modification by deprivation
Heat Smart Orkney: The Commercial Case for Demand Response
Curtailment modelling & commercial viability analysis for Kaluza
A residential demand-response scheme would absorb less than 1% of Orkney's curtailed wind energy — and still turn an annual profit of £151,000 at scale.
Orkney curtails roughly 23% of its wind generation — about 450,000 MWh a year — because the 40 MW subsea export cable saturates well before local wind capacity does. Kaluza asked whether a residential demand-response scheme, using smart storage heaters, could turn that waste into revenue.
We modelled turbine telemetry to build a validated power curve, quantified curtailment by season and time of day, and built a household-level financial model spanning device cost, flexibility payments from the network operator, and wind-farm service fees.
Even at the technical ceiling (32% penetration), the scheme absorbs only 2,023 MWh/year — 0.75% of the addressable curtailment pool. But because revenue comes from institutional flexibility payments rather than household subscriptions, the unit economics work well before the environmental impact is large: break-even lands at 15% penetration.
Why it matters
The scheme was never going to solve curtailment; the commercial opportunity is monetising flexibility, not eliminating waste. Recommended a phased rollout starting at the 15% household-penetration break-even point, contingent on locking in a long-term SSEN flexibility agreement.
Base case: SSEN flexibility payment £250/MWh, wind-farm fee £25/MWh. Break-even (highlighted) lands at 15% penetration.
Methodology & data+
Dataset: Rousay turbine telemetry (2015–2018) & Orkney residential demand data
Tools: Python · pandas · Power curve modelling · Financial / sensitivity modelling
- Cleaned turbine telemetry (2.63 years, 1-min intervals) to classify wind-limited, capacity-limited, and demand-limited (curtailed) states
- Annual curtailment: ~450,431 MWh/yr (23.1%), of which 268,146 MWh/yr (59.5%) is DR-addressable
- Sense-checked against published Rousay turbine curtailment data (model within 1.07× of implied fleet total)
- Household financial model: device cost, 12-year lifetime, SSEN flexibility payment (£250/MWh base case), wind-farm service fee (£25/MWh)
- Break-even at 15% penetration (£36,590/yr profit); £151,112/yr profit and £283,459 cumulative cash flow by Year 5 at 30% penetration
- Sensitivity analysis: SSEN flexibility price is the dominant driver of viability, ahead of wind-farm fees or subsidy