Final Project Diamond Database

 Problem Description:

 I have always been interested in diamonds—their value, how they are traded, and the global network behind them. Growing up in India, where diamond buying is culturally significant and cities like Surat serve as major worldwide centers for diamond cutting and polishing, I was frequently exposed to discussions about diamond quality, pricing, and craftsmanship. I also enjoy watching documentaries about diamond grading and international trading, which deepened my curiosity about how specific characteristics influence market value. Although my personal experience is rooted in the Indian diamond market, the dataset used in this project—the well-known diamonds dataset—reflects U.S. retail pricing. This creates an interesting opportunity to compare my personal understanding with patterns observed in a different but well-established market. Importantly, many of the attributes used to evaluate diamonds—carat weight, cut quality, clarity, and color—are universal grading standards that apply consistently across global markets, including both the U.S. and India. These variables form the “Four Cs,” a framework recognized internationally by jewelers, traders, and gemological institutes. This makes the dataset meaningful beyond its geographic context. 

The research question for this project: 

Which universal diamond characteristics—such as carat weight, cut quality, clarity, and color—have the strongest influence on price, and how can data visualizations help reveal these relationships?

To explore this, I analyze the dataset using scatter plots, regression models, correlation heatmaps, and multivariate visualizations. The goal is to uncover how these key variables jointly shape diamond pricing. This topic is personally meaningful, connecting my background and experience in the Indian market with a structured, data-driven exploration of price determinants in the U.S. market.

Related Work:

The following visuals on the diamonds dataset inspired this project:







Solution:
To analyze how universal diamond characteristics influence price, I used the diamonds dataset in R and applied a sequence of visual and statistical methods that build progressively from simple to more complex multivariate analysis. Each visualization was chosen to answer a specific part of the research question while deepening my understanding of the relationships among diamond attributes.




I began with a scatter plot of carat weight vs. price, adding a regression line to quantify the trend. This is the most fundamental relationship in diamond pricing, and the plot clearly shows that price increases sharply with carat weight, confirming expectations from the diamond market. This first visualization serves as a baseline for the rest of the analysis, establishing that carat is the strongest single predictor of price, but also revealing significant variability that must be explained by other factors.

Next, I introduced cut, one of the 4 Cs, using a boxplot of price grouped by cut quality. While the scatter plot highlighted continuous variation, this visualization captures how a categorical grading standard affects price distribution. The boxplot shows clear differences between cut categories, revealing that diamonds with better cuts tend to have higher median prices. However, the overlap between groups suggests that cut alone does not fully explain price variation, motivating the need for additional multivariate analysis.

To deepen the analysis beyond individual variables, I created a multivariate scatter plot that examines how several diamond characteristics interact to influence price. This visualization plots carat weight against price, but it also incorporates cut quality by dividing the graph into separate facet panels and includes clarity as a color-coded dimension. Within each cut category, a regression line is added to highlight the trend between carat and price. Presenting the data in this structured way allows for a more nuanced examination of how price is affected when changes in carat weight are viewed in combination with additional quality measures. As the plot reveals, the slope of the regression line becomes steeper as cut quality improves, indicating that an increase in carat weight results in a larger price increase for diamonds with superior cuts.

Clarity adds a further dimension of insight, as the color scale shows that higher-clarity stones consistently appear in higher price regions within each cut category. This pattern demonstrates that clarity continues to differentiate diamonds even after accounting for cut and carat, reinforcing the idea that the 4 Cs operate collectively rather than independently. Lower-clarity stones tend to cluster at lower price points, while higher-clarity stones follow steeper upward trajectories along their regression lines. By bringing multiple attributes together in one visualization, the faceted scatter plot highlights how these characteristics interact and provides a more complete understanding of the factors that shape diamond pricing in real-world market data.


Lastly, I created a heatmap that displays the median price for each combination of color and clarity. This approach summarizes complex multivariate information in a format that is easy to interpret visually, allowing patterns across categorical grading scales to emerge clearly. The heatmap reveals that diamonds with the highest clarity grades, such as IF and VVS1, consistently command higher median prices across almost all color categories, while lower-clarity stones show noticeably reduced price levels. Color also plays a significant role, as diamonds with better color grades, particularly D through F, tend to appear in the lighter, higher-price regions of the heatmap. As color quality decreases toward grades such as I and J, the median price generally declines even when clarity is high, suggesting that color and clarity interact in meaningful ways to shape overall value. This visualization reinforces the concept that diamond pricing is determined by the combined effects of multiple quality attributes rather than any single characteristic on its own. 

Discussion and Conclusions:

This project analyzed the diamonds dataset to understand how universal grading characteristics—carat, cut, color, and clarity—influence diamond prices in the U.S. retail market. The progression of visualizations showed that carat weight is the strongest individual predictor of price, but the additional analyses made it clear that quality attributes significantly shape the value of a diamond. The boxplot revealed meaningful differences in price distribution across cut categories, and the multivariate faceted scatter plot demonstrated how cut and clarity modify the carat–price relationship in ways that cannot be understood through bivariate analysis alone. The heatmap further confirmed the combined influence of color and clarity, illustrating that higher grades in both categories consistently align with higher median prices. Altogether, the results reaffirmed the importance of evaluating diamonds using the 4 Cs as an interconnected system rather than isolated variables.

Although the dataset originates from the U.S. market, the universal nature of the 4 Cs allows these findings to be relevant to diamond buyers and traders globally, including markets like India where diamond purchasing is deeply rooted in cultural and economic traditions. Future work could expand this project by integrating datasets from international marketplaces, auction houses, or wholesale trading centers such as Surat, Antwerp, Dubai, or Hong Kong. Comparing regional pricing trends, cultural preferences, and market-specific valuation patterns could provide a more comprehensive global view of how diamonds are priced. With richer and more diverse data, the analysis could evolve to include time-series forecasting, machine learning pricing models, or even geospatial mapping of global diamond supply chains, offering more sophisticated insights into this global luxury commodity.






R Code for reference:

#load data and packages

library(ggplot2)

library(dplyr)


data("diamonds")


#scatter Plot + Regression: Carat vs Price

ggplot(diamonds, aes(x = carat, y = price)) +

  geom_point(alpha = 0.2) +

  geom_smooth(method = "lm", color = "red") +

  labs(

    title = "Relationship Between Carat Weight and Price",

    x = "Carat",

    y = "Price (USD)"

  )


#boxplot: Price by Cut Quality

ggplot(diamonds, aes(x = cut, y = price, fill = cut)) +

  geom_boxplot() +

  labs(

    title = "Diamond Price by Cut Quality",

    x = "Cut",

    y = "Price (USD)"

  )


#faceted scatter with regression, colored by clarity

set.seed(123)

diamonds_sample <- diamonds |> sample_n(5000)  # to keep it readable


ggplot(diamonds_sample,

       aes(x = carat, y = price, color = clarity)) +

  geom_point(alpha = 0.4) +

  geom_smooth(method = "lm", se = FALSE) +

  facet_wrap(~ cut) +

  labs(

    title = "Carat vs Price by Cut, Colored by Clarity",

    x = "Carat",

    y = "Price (USD)",

    color = "Clarity"

  )


#heatmap multivariate plot

library(dplyr)


heat_data <- diamonds |>

  group_by(color, clarity) |>

  summarise(median_price = median(price), .groups = "drop")


ggplot(heat_data, aes(x = color, y = clarity, fill = median_price)) +

  geom_tile(color = "white") +

  scale_fill_viridis_c() +

  labs(

    title = "Multivariate Heatmap: Median Price by Color and Clarity",

    x = "Color",

    y = "Clarity",

    fill = "Median Price"

  ) +

  theme_minimal()













Comments

Popular posts from this blog

Module 6. basic data visualization using R

Module 5. Assignment: Create Your Own Visualizations assignment Plotly vs. Datawrapper