Publish Date: Jun 10, 2017    Last Update: Jun 10, 2018


SEC has made public its server log files for its EDGAR system. The data describes how investors about how investors access corporate filings. Currently, it covers all the SEC EDGAR website traffic from 2003 to 2017.

Efficient parsing

One problem with processing the data is the huge file size. The raw data is about 4 terabyte in size, making even simple queries time-consuming.

To tackle the problem, I utilize the state-of-the-art big data tools to expedite the data querying process. More specifically, I first use Hive to convert the data into a columnar format called parquet, and then query the data using Spark. With a cluster of machines, you can make complicated queries within minutes. For my use, I build the whole infrastructure on Amazon Web Service. If you want to use the above two datasets for academic purposes, or want to know more about the data processing steps, please contact me at I have also provided some pre-processed datasets below for academic use.

Summary statistics

The total number of requests from the dataset is immense. Below is the total number of observations by year1

Table I - # of observations by year

year # of obs.
2003 52,669,597
2004 172,850,093
2005 94,420,455
2006 136,322,271
2007 222,409,334
2008 251,946,216
2009 639,478,569
2010 1,051,235,018
2011 1,122,581,998
2012 1,752,489,883
2013 2,678,929,691
2014 2,978,979,793
2015 4,057,641,724
2016 6,661,806,472
2017 4,517,785,496

Data cleaning

If the goal is to analyze the activities of human users, you will need to take several additional steps to clean the data. First, you need to drop requests that are error or redirection by requiring the server code to be smaller than 300 (not included). Second, you need to drop the requests on the index page (idx = 1), otherwise it will create double counting. Third, you want to drop requests where the web crawler flag is 1 (crawler = 1). Of course, not all web crawlers will declare itself as robot. I will address the issue in later parts. Summaries of these filters are shown below

Table II - # of requests by server code

code # of obs.
0 145,319,660
200-299 21,539,431,112
300-399 3,282,042,265
400-499 1,325,675,392
500-599 99,039,166
Other 104

Table III - # of requests by idx

idx # of obs.
0 14,332,808,953
1 12,058,699,303
Other 5,796

Table IV - # of requests by crawler

crawler # of obs
1 1,170,085,308
0 25,221,426,225
Others 5,013

To clean the data, I run the following SQL query

WHERE (code < 300) AND (idx = 0) AND (crawler = 0)

After cleaning the data, there are now 10,956,485,556 observations in total from 2013 to Jun. 2017.

Web crawlers

To eliminate the web crawlers, there are two sets of criteria used in the literature.

  • Method 1:

    • An IP is a web crawler if it
      • made more than five web requests in a minute, or
      • made more than a thousand web requests in a day
    • Proposed in Drake, Roulstone, and Thornock (2005)2, also used in Chen, Kelly, and Wu (2018)3
  • Method 2:

    • An IP is a web crawler if it requested information for more than 50 firms in a day
    • Proposed in Lee, Ma, and Wang (2015)4, also examined in Loughran and McDonald (2017)5

A common disadvantage of these methods is the possibly large type I error in identifying web crawlers. The reason is that larger institutions have more users than the IP addresses they have. Thus they adopt network address translation (NAT) method to route the web traffics, and as a result, many users might share a single single outbound IP address, and the above methods might falsely tag these IPs as web crawlers.

Filing types

Top users

Pre-processed dataset

Coming soon.

  1. Note that the data in 2017 only includes months from January to June. [return]
  2. Drake, Michael S., Darren T. Roulstone, and Jacob R. Thornock. 2015. “The Determinants and Consequences of Information Acquisition via EDGAR.” Contemporary Accounting Research 32 (3): 1128–61. [return]
  3. Chen, Yong, Bryan T. Kelly, and Wei Wu. 2018. “Sophisticated Investors and Market Efficiency: Evidence from a Natural Experiment,” Working paper [return]
  4. Lee, Charles M. C., Paul Ma, and Charles C. Y. Wang. 2015. “Search-Based Peer Firms: Aggregating Investor Perceptions through Internet Co-Searches.” Journal of Financial Economics 116 (2): 410–31. [return]
  5. Loughran, Tim, and Bill McDonald. 2017. “The Use of EDGAR Filings by Investors.” Journal of Behavioral Finance 18 (2): 231–48. [return]