We actually don't count users, but we count requests to the directories that clients make periodically to update their list of relays and estimate user numbers indirectly from there.
No, but we can see what fraction of directories reported them, and then we can extrapolate the total number in the network.
We put in the assumption that the average client makes 10 such requests per day.
A tor client that is connected 24/7 makes about 15 requests per day, but not all clients are connected 24/7, so we picked the number 10 for the average client.
We simply divide directory requests by 10 and consider the result as the number of users.
Another way of looking at it, is that we assume that each request represents a client that stays online for one tenth of a day, so 2 hours and 24 minutes.
Average number of concurrent users, estimated from data collected over a day. We can't say how many distinct users there are.
No, the relays that report these statistics aggregate requests by country of origin and over a period of 24 hours.
The statistics we would need to gather for the number of users per hour would be too detailed and might put users at risk.
Then we count those users as one. We really count clients, but it's more intuitive for most people to think of users, that's why we say users and not clients.
No, because that user updates their list of relays as often as a user that doesn't change IP address over the day.
The directories resolve IP addresses to country codes and report these numbers in aggregate form. This is one of the reasons why tor ships with a GeoIP database.
Very few bridges report data on transports or IP versions yet, and by default we consider requests to use the default OR protocol and IPv4.
Once more bridges report these data, the numbers will become more accurate.
Relays and bridges report some of the data in 24-hour intervals which may end at any time of the day.
And after such an interval is over relays and bridges might take another 18 hours to report the data.
We cut off the last two days from the graphs, because we want to avoid that the last data point in a graph indicates a recent trend change which is in fact just an artifact of the algorithm.
The reason is that we publish user numbers once we're confident enough that they won't change significantly anymore.
But it's always possible that a directory reports data a few hours after we were confident enough, but which then slightly changed the graph.
We do have descriptor archives from before that time, but those descriptors didn't contain all the data we use to estimate user numbers.
Please find the following tarball for more details:
For direct users, we include all directories which we didn't do in the old approach.
We also use histories that only contain bytes written to answer directory requests, which is more precise than using general byte histories.
Oh, that's a whole different story. We wrote a 13 page long technical report explaining the reasons for retiring the old approach.
tl;dr: in the old approach we measured the wrong thing, and now we measure the right thing.
We run an anomaly-based censorship-detection system that looks at estimated user numbers over a series of days and predicts the user number in the next days.
If the actual number is higher or lower, this might indicate a possible censorship event or release of censorship.
For more details, see our technical report.