Data is Beautiful

1167 readers

1 users here now

Be respectful

founded 5 months ago

MODERATORS

[email protected]

346

The numbers 0–99 sorted alphabetically in different languages (files.catbox.moe)

submitted 5 months ago by [email protected] to c/[email protected]

50 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 39 points 5 months ago* (last edited 5 months ago) (3 children)

A bit confusing to read. The points are placed on the y-axis using ordinals rather than cardinals. This means if you were to extend the plotting (say, up to 200) it would cause the existing data points to move around. That’s not usually what we expect when plotting data.

Edit: actually, the problem is more severe than I initially thought. If the y-axis were plotted with cardinals (the way we usually plot data) then the German case would show 10 horizontal lines, immediately revealing a pattern in the data (caused by Germans speaking the ones digit before the tens digit).

[–] [email protected] 10 points 5 months ago

Initially, I thought that you were talking about ordinal vs cardinal numbers (ie first vs one), which was a bit confusing. But, when trying to understand the placement of zwei in the German graph, I realized that you meant that the points on the Y-axis are sorted relative to one another rather than relative to the Y-axis scale as a constant.

I see that such plotting could be useful in some circumstances (shows some interesting clustering in other languages) but, I don't like it.

[–] [email protected] 7 points 5 months ago (3 children)

What's the problem? The y-axis is sorted from A at the bottom to Z at the top.

[–] [email protected] 12 points 5 months ago (2 children)

Let’s say you were plotting some temperature data. You take the temperature every day and record it for a month. When you go to plot the data, the normal thing to do is decide on the scale for the y-axis and then plot each temperature point according to where it fits on that scale. This allows you to see any trends in your data (perhaps it’s spring and the temperature is trending upwards over the month).

What you don’t do is sort your temperature data and then put the lowest temperature at the very bottom and the highest temperature at the top, with every other point spaced evenly between those extremes according to their rank. This completely obscures the relative temperature differences between the points!

Well this is what was done with the number words data we’re discussing. Look at the plot for English. Notice that zero is in the top left (because z is last in sequence), followed by one halfway up, which is also okay. But then look at two and three. You would expect two and three to be very close together because they both start with t, but they’re not. Words starting with t should be around 76% of the way up the y-axis (because t is the 20th letter of the alphabet) but two is at 99% of the way up and three is 77% of the way up.

This is problematic if you’re hoping to use the plots to spot trends. For example, with German (as another commenter pointed out) all 2-digit number words read the ones place before the tens place. If the data were plotted by cardinality (treating each word as a rational number between 0 and 1) then you’d easily spot this trend in German number words because all the points would fall on roughly horizontal lines.

[–] [email protected] 5 points 5 months ago (1 children)

Is there a good way to do this? I am thinking one could (taking English as an example) treat each word as a base-26 number (o.ne, t.wo, t.hree, ...) and divide them by 26 to normalize values between 0 and 1.

[–] [email protected] 3 points 5 months ago

Yes, that’s exactly the way to do it!

[–] [email protected] 3 points 5 months ago

Oh now I finally see it. I thought all just gad their limits from A to Z, but they are all different. That's just... wrong

[–] [email protected] 3 points 5 months ago

All data points, from all series are sorted on the Y-axis relative to one another, not the external constant of the alphabet. This is contrary to how graphs are most frequently plotted and means that the shape of the data can change significantly, based upon the size of the dataset. It's not that it's an invalid way of plotting, just unusual and, personally, I don't like it.

[–] [email protected] 1 points 5 months ago

Agreed. Proper graphs should be easily interpreted by most people looking at them, without asking a bunch of questions.

This one is a bit too out there. By a bit, much too far. This could not be published in a scientific journal. (Although a lot of published graphs aren't great either).