According to the Gartner IT Glossary: “Big Data”* is high-volume, high-velocity, and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. This is more or less how the idea of big data was first conceived. Now the three “V’s” have grown and grown; for example, some organizations have added Veracity another has added Vision.  There is nothing rigorous or complete built into the “V-system.” These “Vs” refer to, describe or portray the data sets that are involved in Big Data.   V words could be added by the many.  Many attractive V-words have not been added, e.g., Very, Voluptuous, Vane, Vindictive, Vigorous, and so forth  The same is true with respect to new V words that have large data groups that are not within the definition or specs for actually BIG DATA.  Real “Big Data” as opposed to data sets that are simply bit is probably not going to be found in the accumulated file of even a really BIG law firm.  It is more likely to be millions upon millions upon millions of “pieces” of data.  Its even billions of pieces sometimes.
Here is how PC Magazine defined Big Data:
Big Data refers to the massive amounts of data collected over time that are difficult to analyze and handle using common database management tools. Big Data includes [enormous and lengthy series of] business transactions, e-mail messages [by the millions, at least], photos [in similar volumes], surveillance videos, and activity logs. Scientific data from sensors can reach really mammoth proportions over time,  Social media are the same sort of thing. [A problem here is to guess at what might count as massive. Most lawyers–among others–could not really think realistically or concretely in such large terms.]

Wikipedia defines Big Data this way:
Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

“Difficult” is the wrong word here. Running the Library of Congress was difficult when it was done by hand; given that big data tools are now used, the matter is easier, even though massive + massive amounts of data and therefore information is collected and analyzed.


The two types of functions are not, as it were, on the same planet.]  The challenges include capture, collection, storage, search, sharing, transfer, analysis, and visualization.

Not all of these are real problems in every case and/or together. It is not hard for Wal-Mart to collect its sales documents, although designing the collection methods may not have been easy.


The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine the quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”

Notice that none of what is said here amounts to asserting that one document, including all the forgoing, is automatically a separate category.  Ask yourself this:  Is a given document–its addressees, date, content, author, &c.–likely to be contained in one digit or one data point–especially if a discovering lawyer, for example, wants to know who knew what, from whom.  What about a document with a lot of pages? What a chain of emails?
Here, more or less near the start of this discussion, is a good time to say that the truly extraordinary accomplishment involving big data involved correlations.  Asking simple questions like “Where is Michael today?”  Or, “Where was Michael yesterday?” may not be provide-able.


Webapedia says this among many other things:
Big data is a buzzword or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it’s difficult to process using traditional database and software techniques.
IBM defines big data as:
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. [Here is another way to put this point. The amount of data in sets that count as big data is unimaginably huge.] Se Nate Silver, THE SIGNAL AND THE NOISE:  WHY SO MANY PREDICTIONS FAIL–BUT SOME DON’T N.27-40 (2012).  Silver is a spectacle of predictions based simply on stats and another form of objectivity.
ABA said this in its May 2013 issuance:
Big data, loosely defined as the computer analysis of torrents of information to find hidden gems of insight, is slowly transforming the way law is practiced in the U.S. [Talk about bullshit, this is a paradigm.

(1) What is a “torrent” in ordinary English?  Relative to  “bigness” can some torrents be quite “small”? What is it a torrent of? What’s the significance of the torrent?  How fast was the torrent’s flow?  Not every torrent implies big data flowing. Indeed, most do not. Big data is very special.

It is also worth pointing out, here and now] that the word “torrent” has at least one special meaning in computer “jargon,” a name, to wit: “BitTorrent,” BitTorrent” protocol, and a few others.  Among other oddball facts the phraseology is linked up to the name “Pirate Bay,” but that is of doubtful significance here.  Most importantly, perhaps, is that a torrent is relatively small (around a few kilobytes), and that plainly has nothing to do with big data.  The ABA draws no distinction and doesn’t seem to realize that there are two very distinct usages, meanings, and ideas.  Here is a reasonable inference:  With respect to the ABA’s outlook, it cannot distinguish “shit from Shinola.”


(2) What about the idea of “gems”?  [The ABA metaphor is nonsense. A gem is a relatively small object, like a diamond on a $1M Cartier necklace. Looking for a gem would be like digging around on a beach looking for something quite small.  That would work on a hypothesis that it–the single object–might be there. This is not the study or use of big data works. One is not looking for single objects; one is looking for huge patterns from which (at first) tentative conclusions can be drawn. The process is not a perfect one; it is not necessarily accurate; indeed, it has been called “messy” by some world-leading experts. Better to say that someone mining big data is looking for gold; at least it only comes in nuggets when it is in a river, or more likely very close to or under it.

In terms of what many lawyers do a lot of the time,  one of the central theses of the ABA’s article is especially absurd. Most lawyers do not themselves get involved with nearly enough documents to describe them as mining in, manipulating data “piles,” or pursuing correlations in the realm of big data.    No, or virtually no, suits involved big data, even if they involve a lot of data: enough for example to involve “predictive coding.”  Then again, no doubt there is probably some business transaction with might; consider a merger of Wal-Mart and Amazon.

In addition, at least as important, is the fact that virtually all lawsuits involve looking for causes. The function of analyzing big data is not to determine causation–the why–it is to determine the what, where, when.  Imagine trying any sort of case without trying to determine individual facts (“Did that doctor foul-up and why did he do it?) and relevant standards (“Did that accountant act in accordance with the applicable standard of care?) These are not the kind of analyses and conclusions that big data is likely to really be helpful to lawyers.

The ABA article talks about using big data to determine what other law firms are charging clients, what kinds of cases are being won or lost, summary judgments being granted or denied, and/or which courts grant what kinds of sanctions. It even says that big data will tell us why a judge refused to grant certain types of motions. At this last point, the ABA’s articles go well below the surface of nonsense.

Information regarding what other law firms are charging clients is probably not big data. There is not enough information. In addition, one law firm does have data regarding that kind of information, if it even exists. Moreover, knowing what other law firms are charging is not helpful. The contemplated and recommended analysis would have to do with what is being charged for what. Modest-sized real estate deals are not the same, so far as fees are concerned, when compared to gigantic antitrust cases or huge securities cases.  The reader must ask him/herself how the correlations might work, and also ask what was being correlated with what?

What matters in the analysis of big data–as everywhere else in the world of big data–is the what; the why is not–and cannot be–the focus. 

(1) Conjectures about the whys may be easier to work with,  if one has enough sufficient actual data. 

(2) Patterns of what large business clients (and small ones, for that matter) are easy for lawyers to figure out: “Ask them,” for one. thing.  This does not take big data, and that will certainly not solve “why” questions.  

(3) Only the area of gargantuan discovery–a very, very rare area–is the homeland of big data.  I just read 20,000 documents.  This was not big data, or anything even close to it.

(4) There is even less data for determining what judges are doing. Besides, judges have to be divided into subcategories: What jurisdiction? What issues? and so forth. Big Data will not reveal what judge is making the decisions s/he is making.  Here is a big data question: How many times does the word “idiot” appear in Texas appellate cases. Of course, that information can be retrieved from two Big Data archives.

The data numbers that “move” in big data circles are measured by exponential multiples: 1, 2, 3, 4 of 1024. They have separate names. Starting with bits, there are bytes, kilobytes, megabytes, gigabytes, and so forth, the numbers go up with amazing “speed.  Maybe the reader might wish to square 1024, then cube it,  then keep going up the steep incline to big data–amounts “galaxies” larger than “predictive coding.”   I doubt that there are anywhere near the number of law firms doing business work and having more than 60 lawyers in the United States, or even a total staff, that would in any way close to big data, or even remotely near those kinds of numbers. I doubt that the total number of lawyers in the U.S. is enough to, as a general rule,  generate enough data a year to enter into the category of big data.  Or, at any rate, enter that class and generate useful commercial information.

The ABA is not the worst offender at producing bullshit, so far as law firms are concerned, though its view is god awful. Consider a remark by Sol Irvine the author. Before recently, he says, but presumably, before the rise of big data, lawyers were essentially glorified librarians.” Few statements about what lawyers do could be more mistaken.

If one considers the following list, it is easy to see concretely why law firms are not situated in the big data “marketplace.”  I am also inclined to wonder whether the reviews universities do on their online course when used by students are within reach of actual big data treasure troves. Marc Parry, Big Data on Campus, NYT July 18, 2012.

  • Google’s location of the swine flu epidemic,
  • learning airline ticket flux and timing it
  • Google map
  • Wal-Mart daily sales
  • Amazon’s sales
  • same for Target
  • records of national and international  bank transactions (e.g., debit cards use)
  • construction of world wide interconnected transaction on a nearly basis,
  • Visa daily tracking
  • ZestFinancial studies
  • Avia, Prudential, and AIG premium calculations
  • immunizations
  • premium baby problems built on a huge amount of data (relatively small)
  • Sloan Digital Sky Survey (in contrast)
  • UPS truck repair timing
  • manhole cover explosion problems (few manholes, 94,000 or so, but much other measured)
  • Google’s book collecting
  • Kindle
  • GPS
  • spell checkers and their expansions (Google and Microsoft)
  • the U.S. governments checking on calls, etc.,  among citizens
  • the US government’s checking incoming calls, emails, and who knows what else
  • And so forth

Of course, this is an incomplete list.  It is also obvious that law firms are almost never involved in collecting this kind of data.  Most of the ideas in the forgoing list have been extracted from  Victor Mayer-Schonberger and Kenneth Cukier, BIG DATA: a REVOLUTION THAT WILL TRANSFORM HOW WE LIVE, WORK, AND THINK.  (2013). 


The first listed author is a professor at the Oxford Internet Governance and Regulation at Oxford University and the second is a journalistic commentator for several world-respected newspapers and journals.  He is now the editor at the Economist, perhaps the world’s leading economics and finance (among other things) weekly.  Some of the ideas explored in this blog come from this very helpful source.

The unhelpful literature on big data–really advertisements from law firms and vendors for law firms–continues.  In addition, most of these ads either conceal or don’t know the difference between “lots of data”–as in pricing-history and current-pricing with 1000 clients–and “big data.” 

Of course, there is another category: pricing history with 5000 former clients–a useless enterprise if ever there was one for determining future pricing based on then-current market conditions and business sociology.  Other ads come from vendors of big data services, although they do not explain what they proposed to deliver. 


One of the vendors went so far as to suggest that the institution of e-discovery in rules of evidence is one of the principal causes of the way big data could (and should) be used by law firms “today and tomorrow.” This proposition something like the ABA could have been published.

The only vendor that I   can recall is that of a vendor of big data-based services that admitted that 100M documents were big for it to handle.  Another one, at least impliedly, indicates that law firms are not likely to be able to do this by themselves and should form coalitions to hire outside vendors.

 This is in contrast to an ad published by a law firm (while trying not to look like it) which says that the law firm among many which do not engage in digitizing and “data-fying” its own files will have real trouble as a law firm. “Real trouble” in the arena of businesses, even those of professionals, in the end, means “serious money [or profit] troubles.  Look at what happened to the formerly celebrated  Dewey firm.

As a whole slew of magazine pieces, there are law firms, always named, obviously seeking business–so far as I can tell–but not actually talking about “big data” in their texts, except in the headline of the ad.  They certainly do not explain big data at all.  Perhaps it’s better this way since it is reasonable to infer that they have no real idea what they are talking about.

An important relatively new practice is the electronic storage of information. Clouds are a big deal.  Law firms use them extensively, and they have contributed to new discovery rules in civil cases. They are being used more and more. There are a number of different ways to do this, but in large cases, a court recently approved “predictive coding” as a sound methodology in at least some cases. The Magistrates opinions in Moore v. Publicis Groupe & Publicis Groupe & MSL, 287 F.R.D. 182 (2012) are helpful about the lingo and how for conflicting sides to get a relevant job done.  It is not about the contrast between the object of predictive coding and the use of big data. In fact, Judge Peck avoids the phrase “big data” and speaks only of “large data,” i.e., a data set that is large.   It is either fashionable or some sort of misunderstanding of the language and concepts of big data. 

Interestingly, there were 3m pieces of data in the case before the job, but, it looks like, it was mostly emailed.  In addition, in the grand scheme of big data problems, when that phrase is used correctly, 3m is not actually that large.

Perhaps both of these ideas apply since the locution “big data” is now a “buzz phrase,” and therefore almost certainly, at best, misused, misunderstood, set forth incoherently given the proximate science, and false.

Advocates of conclusions based on “buzz phraseology” should be thought of as epistemological “buzz-ards” and therefore not worth listening to. In any current culture, when that phrase is closely connected (or what might be called “ontologically” connected) to any concept or words, the connection always gets things wrong, though it may get a little bit of its “picture” almost right.

It may be reasonably conjectured that Big Data if correctly and adequately conceived,  will play very little (if even that much) of a role in litigation.  (1)  The amounts of data are way, way too large for the purposes and sizes of most lawsuit discovery, not to mention the amounts of money required and available to do it.  (2) In addition, in “predictive coding” discovery methodology, the method of discovery techniques must start with a set of alterable hypotheses specified (or approved) in advance by the participating lawyers (perhaps with technical assistance—as in “These terms, x, y, z, (etc.) will provide valuable guidance to important documents.”  Then those terms are run through the database containing possible relevant documents, to look for key terms, which will help reduce the number of documents that need to be retrieved in order for the parties either (a) to accept the search list, or (b) to contest it, probably with arguments, expert witnesses, and/or to provide concrete alternatives. 

Successive searches may be conducted to provide a reasonable reduction in the available hoard of documents. Then the very large “stack” may be searched again and again, probably with a more expanded list.   Maybe the list will not be expanded; perhaps some terms will be eliminated and others will be substituted.  The components on the discover-the-really-relevant-document lists are not random compilations.

There is an old saying:
This blog post violates that rule.  Its axiom is apparently: SOMETIMES AUTHORS THINK READERS NEED MORE THAN ENOUGH.
Some times, readers might be right.