24 December 2011

Head In The Cloud (or Somewhere)

Came across this posting whilst surfing, with the following:

"For example, one large package deliverer sees a 400% increase in its network traffic, database needs, and computing power over a single 45 day window. Utilizing a service such as Azure will allow them to pay for that excess capacity only in the 45 days it is needed, not the other 320 days a year when it is not utilized, but must be maintained, upgraded, and licensed."

Why folks persist in believing that there is a Free Lunch up in The Cloud??? The notion that seasonal/repetitive demand spikes for IT resources are uniformly distributed over a time period (day/week/year) is just silly. Yes, as a loss leader, a Cloud Vendor may choose to not add a Load Factor Penalty. In the beginning. But, that's not sustainable, because demand spikes aren't uniformly distributed. Cloud vendors will make the clients pay for all that idle storage and cpu (and a nice profit on the idle riches), you betcha. That package deliverer experienced the same spike as all of the retail chain. And all of the energy vendors. And so on. It's Econ. 101, folks.

21 December 2011

Icarus

The sky is falling. Oracle reported down, a bit, but, in particular, didn't report above expectations. Larry has been sly for a long time, in setting guidance low enough that bettering it is a piece of cake. Not this time. As I type, it's down 15%, and news is that much of the tech sector is getting the flu.

The knee jerk reaction: mortgage the farm and buy Oracle stock. IIIIIIIIIIII'm not so sure this time. Here's why.

Oracle didn't get quite the *new* software sales and *new* hardware sales. The latter is, by all accounts, due to customers waiting on the new machines during the quarter. The former is more speculative. The reports are vague. My take: given the aggressive pricing of Oracle RDMBS, ditto for MySql (yes, it's GPL, but Oracle blasted its support prices into the sky in the past year), and putting the screws to java adopters; folks are looking for a safer port.

On the RDBMS side, Postgres gets ever closer to Oracle, if you're not a Fortune X00 company (and even if you are, and building apps off the mission critical axis). Mainstream pundits are crying that "Da Cloud, boss, Da Cloud" is putting Oracle in an untenable position. It is said that cloud providers use dirt cheap components, soft and hard, and Oracle's RDBMS and Sun-ish machines are just too expensive Up There. As if being cheap were the best way to make money!? "Cheap goods sold dear" is an aphorism that's been around forever. The Cloud is shaping up that way, and if so, I'd avoid Fortune X00 companies that chose to put *my data* Up There. In this nascent era of Cloud, too many stories of wandering data to suit me.

What's really stupid about Larry's ploy: the Oracle RDBMS is built on an engine (the piece that actually does all the inserting and updating) which uses Multiversion Concurrency Control (MVCC, as it is known) which is better suited to the asynchronous nature of the Web/Cloud than the locker paradigm that most other (notably, not Postgres) RDBMS have been using for decades. They've been backing in, so to speak, MVCC support recently, but none is a true MVCC database. In other words, Larry has the proper mousetrap for the setting, but has managed to offend his customers. But, that's Larry's way.

Reports say that Oracle claims the shortfall is due to last minute non-signings. If so, then this is an aberrant glitch. Given that Fortune X00 companies are sitting on, by some accounts, more than $3 trillion, there's no macro reason to not buy new IT. Unless you're a Fat Man yearning for Famine.

20 December 2011

A Warren-ted Search

One of the points "for further research" as I used to say when I was an academic, in the Triage exercise was using social media to measure outcomes. R has a library, twitteR, (yes, R folks tend to capitalize the letter at every opportunity), which retrieves some data. I was at first disinterested, since I don't have a twitter account. Thankfully, twits can be gotten without being a twitterer. Since Elizabeth Warren's campaign is just over the border, and sort of important in the grand scheme of things, I've been exploring.

Here's the entirety of the R code (as seen in an Rstudio session) needed to return the twits (1,500 is the max, which will prove troublesome when the battle is fully engaged):

> library(twitteR)
> warrenTweets <- searchTwitter('@elizabethwarren', n = 1500)
> length(warrenTweets)
[1] 9
> warren.Text <- laply(warrenTweets, function(t) t$getText())
> head(warren.Text, 10)
[1] "@elizabethwarren i hope you win agianst sen scott brown. the 99% r with u"
[2] "@elizabethwarren More $$$ coming your way!"
[3] "#HR3505 PAGING: @ElizabethWarren Help us!!!!"
[4] "@elizabethwarren - not to worry, the only job Karl Rove ever got somebody was George W. Bush. and look how that turned out."
[5] "RT @SenatorBuono: What an amazing turnout 4 a superstar. @elizabethwarren"
[6] "HELLO @ElizabethWarren ! PLEASE RUN as a 3rd party or Ind. FOR POTUS2012. Dems just threwSENIORS underthebus for the working tax cut! EXdem"
[7] "@chucktodd We hope 2011 will be remembered for something a LOT closer to home. #ows #OccupyWallStreet @ElizabethWarren #WARREN/PELOSI-2016"
[8] "RT @SenatorBuono: What an amazing turnout 4 a superstar. @elizabethwarren"
[9] "What an amazing turnout 4 a superstar. @elizabethwarren"


The lines starting with > is the R code. The lines starting with [x] are the output. Here we have 9 twits.

Now, what do we do with the text? For that, I'll send you off to this presentation which came up in my R/twitter search (and is the source of what you've seen here), conducted in Boston. Missed it, dang. With slide 11, is the explanation of how one might parse the twits looking for positive/negative response. By the way, even if you're not the least bit interested in such nonsense, visit slide 29.

As I mentioned in Triage and follow-ups, getting the outcomes data is the largest piece of the work. Simply being able to "guarantee" the accuracy of twitter (or any other uncontrolled source) data, given the restriction on returned twits and such, will require some level of data sophistication; which your average Apparatchik likely doesn't care about. The goal, I'll mention again, isn't to emulate Chris Farley's Matt Foley and pump up a candidate no matter what the data say, but to find the candidate out of many most likely to win given some help. Whether Triage would be useful to a single candidate; well, that depends on the inner strength of the candidate.

19 December 2011

Ya Can't Get Theah From Heah

I'm working my way through Wilkinson's graphics book, and right there, on page 404 (could there possibly be a better page for this?) is this:
"The price paid for this efficiency is the loss of flexibility. We cannot examine relations not represented in the hierarchies. Navigation through OLAPs is quite difficult, which is why so many graphical models for representing them have been proposed."

Now, this is a graphics coding guy; he created SYSTAT. He's not, so far as I know, an RM guy. But he gets it.

16 December 2011

Lies, Damn Lies, and Statistics

The Other-R folks have posted a recent entry which references an EMC paper (here if you follow the breadcrumbs) on the state of Data Analysis and Business Intelligence, from the point of view of practitioners. The blog post makes some useful points, but misses some.

I'm referring to the graph in the original post, which is on page 3 (in my screen) of the EMC paper.

What this graph tells me, mostly, is that BI folks are still tied to MS, Excel in particular. Data analysts, not so much; although they'll be tied to corporate policy in such venues.

A few words about each.

Data Storage: SQL Server is tops, which means that most folks, in both camps, are tied to corporate group level machines, not the Big Iron. It's been that way for decades; the analysts have to extract from the Big Iron, and crunch on their own PCs. The categories Other SQL, Netzza, and Greenplum leave room for the Triage with PL/R approach, since the latter are explicitly Postgres and Other SQL is likely as much Postgres as MySql (yuck!). The category is, possibly, misleading if one jumps to the conclusion that companies are MS centric with their data.

Data Management: No real surprise here. Excel is the tool of choice. Way back when I was teaching PC software courses, 1-2-3 was the spreadsheet of choice and all data went through it, and Excel inherited the mindset that a spreadsheet was sophisticated analysis. It is a bit unnerving to realize that so much of what corporations decide is supported by such drek. Note: the BI folks, in the past executive assistants and "secretaries", still use spreadsheets a lot. The Data folks, the other way round. There is small comfort in that. The presence of BASH (or Korn or ...) and AWK (Python and Perl too, but not quite so much; each has bespoke language I/O in the mix) is interesting, in that it means that a fair amount of data is clear text ASCII files. Think about that for a second.

Data Analysis: Clearly, the Data folks use stat packs while the BI folks mostly don't. SAS and SPSS and Stata leading says that the EMC client base is largely large corporate, which isn't a surprise. What is a surprise is the absence of Excel. On the other hand, in the original paper is this (next by each the graph): "While most BI professionals do their *analysis* and data processing in Excel, data science professionals are using SQL, advanced statistical packages...", which corresponds to my experience (emphasis mine).

Data Visualization: The absence of R is suspect, as any R user would understand.

And, finally, this has nothing to do with Big Data, in any case. BD is just another attempt to money-spin by those with an agenda. Janert, in his book "Data Analysis...", makes clear that BD isn't worth the trouble (my inference). The point being that population data, which is what BD offers, is just descriptive stats, and smart data folks aren't interested in descriptive stats. Sports fans, well yeah.

13 December 2011

Jumpin Jack Flash [UPDATE]

In today's posting of the Momentus XT review, Anand (nearly) ends it with this:
"Longer term it's unclear to me whether hybrid drives like the Momentus XT will fill the gap left by SSDs or if software based caching technologies combined with NAND on motherboards will be the preferred route."

Oddly, I wondered much the same a few years back when Sun (pre-Oracle) announced its flash "appliance". That, of course, was for near-mainframe level servers, but was an early assault on the flash-as-HDD approach to using flash. There've been discussions about, particularly in the context of PCIe (Fusion-io, of course) "drives", whether there need be this SSD-as-HDD approach. More folks are talking about storage now as just flash, pretty much, directly wired to the cpu and/or memory manager. As Linus said all those years ago, it will change how file systems are built; if we bother with them at all.

[UPDATE]
Not to mention, this tickled a lower brain stem memory: winFS. Go spend some time with the WikiPedia article. Mayhaps SSD will be the crutch needed?

11 December 2011

Parallax View

"The Parallax View" was a fun movie, especially for those with a conspiracy slant on the world. One might even see it as harbinger of "Robocop". Well, it turns out that there is at least one other take on scatterplot matrix. In the context of the authors' data, it makes some sense.

I tried it with the Triage database table, but am not so convinced it helps for the discrete data proposed. Have a look, and see what you think:

CREATE OR REPLACE FUNCTION "public"."pairs_graph" () RETURNS text AS
$BODY$
library(ggplot2)
library(GGally)
X11(display=':5');
events <- pg.spi.exec ("select period, event, amount, choice, guncontrol, outcome from public.events where candidate = 'Doe' ");
png('ggpairs_graph.png');
events$gcfactor <- as.factor(events$guncontrol)
p <- ggpairs(events, columns=c("period", "amount", "choice", "gcfactor"),
    diag=list(continuous="density",   discrete="bar"), axisLabels="show")
print(p)
dev.off();
print('done');
$BODY$
LANGUAGE 'plr'
And here's the graph:

06 December 2011

Smaller, But Not Worse

Just when the future looks most bleak, comes this news, today. IMFT announces that 20nm flash is in production, and that (look at the pretty picture) 128GB will soon be at your finger tip. The industry may just turn around and do the faster tape waltz. On the other hand, with erase cycles *not* deteriorating, 5NF for large systems is in reach. Perhaps, just may be, The Yellow Brick Road isn't the Road to Perdition (Jude was creepy).

29 November 2011

What's The Difference

To continue with the Triage project, I've spent a day or two with more graphics texts (about which I'll be musing anon), and getting more familiar with the mapping scenarios.

Separate from the scatterplot matrix data shown in Triage, which would be used to measure the micro components of a campaign, is the question of displaying national trend, twixt Us-uns and Them-uns. For that one turns to map graphics, which is a whole other world. Still in R, mind, but not statistical in nature.

What I have recently found is this site, which replicates a US map with 2004 election results. Now, our Apparatchiks won't be downloading zip files from outside sources, of course. On the other hand, the files make for a perfect dive board for the PoC. Load them into PG, swapping Republican for Bush and Democrat for Kerry and Other for Nader (that's not much of a stretch!). Just for completeness, I'd found much earlier (but can't find that I'd cited), this map exercise, but as of now, the author has been too embarrassed to post the R that does it. While only some form of income data (not specified), it is a follow-on (linked to) to an election stream map set, also not supplied with the R that made it. Nevertheless, one can conclude that with enough time, this is a task suited to R. As mentioned in an earlier post, the animation bits are likely via googleVis.

I'll be using his data, since it provides a basis and I don't have to concoct some, though not the R he used (still using the stock R from Wickham). It's not clear how the numbers were derived.

What is really useful about the 2004 map posting is the data source: a county level count. Get these into a PG table, and we have a surrogate for data which our Apparatchiks would have, and which we can further expand with relatively simple SQL; just to see how a map would change. The notion for this part of the Triage effort is to measure the effect of national campaign spending, post some event/ad/debate/foo, at the POTUS/party level; a RNC/DNC (or 501/527/foo group) view of the country.

Here's the new PG table where we load:

CREATE TABLE public.election (
state varchar(25) NULL,
county varchar(25) NULL,
tot_precincts int4 NULL,
precincts_reporting int4 NULL,
republican int4 NULL,
democrat int4 NULL,
other int4 NULL,
constraint pk_election unique(state, county)
)
WITHOUT OIDS
TABLESPACE pg_default


And we get it loaded thus (concated from the state/county files in the zip):

copy public.election from '/databases/rawdata/2004election/output.txt' using delimiters ';' csv header

Note that column names are underscored, rather than camelCase, since PG forces quoting to use anything in the database if there are Caps in names. Yuck.

And here's the PG + PL/R (I've left it as is; comment/uncomment to generate each of the maps, this is the difference map, shown last. The first set are for the two event maps, while the other is for the diff map):



CREATE OR REPLACE FUNCTION "public"."us_graph" () RETURNS text AS
$BODY$
X11(display=':5');
pdf('US_graph_diff.pdf');
library(maps)
library(plyr)
library(proto)
library(reshape)
library(grid)
library(ggplot2)
library(mapproj)
states <- map_data("state")
#elections <- pg.spi.exec ('select state, sum(republican) as "Republican", sum(democrat) as "Democrat" from election where event_number = 2 group by state order by state');
elections <- pg.spi.exec ('SELECT a.state, sum(a.republican - (SELECT b.republican FROM election b WHERE b.event_number = a.event_number - 1 and a.state = b.state and a.county = b.county)) as Republican FROM election a where a.event_number = 2 group by a.state ORDER BY a.state ')
elections$state <- tolower(elections$state)
elections$republican <- elections$republican/10000
choro <- merge(states, elections, sort = FALSE, by.x = "region", by.y = "state")
choro <- choro[order(choro$order), ]
#p <- qplot(long, lat, data = choro, group = group, fill = Republican / Democrat, geom="polygon", asp=.6)
p <- qplot(long, lat, data = choro, group = group, fill = republican, geom="polygon", asp=.6, main = "Poll Shift", xlab = "", ylab = "")
p + labs(y = "", x = "")
p + opts(panel.grid.major=theme_blank(), panel.grid.minor=theme_blank(), panel.background=theme_blank(), axis.ticks=theme_blank())
p + scale_x_continuous("")
p + scale_y_continuous("") + coord_map()
p + opts(axis.text.x = theme_blank(),axis.text.y = theme_blank(), axis.title.x = theme_blank(), axis.title.y = theme_blank(), axis.tick.length = unit(0, "cm"), axis.ticks.margin = unit(0, "cm"))
p + scale_fill_gradient(limits = c(0, 90))
print(p)
dev.off();
print('done');
$BODY$
LANGUAGE 'plr'


All that spinach for the library calls got eliminated by making an .Rprofile in postgres user's home with the following line:

.libPaths("/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")

You could also call out the libraries explicitly; both ways work. The additional spinach is various directions to eliminate the lat/long grid on the maps. None work!


Here's the Event 1 map:


Now, let's update the table to include an event_number (easier than using a date, anyway) and an event_type. That way, we can generate maps in sequence, but also note what sort of event just/last happened. We could also generate maps sequences for only certain sorts of events (they'd be in a check constraint).

So, let's make some new data:

insert into election (select state, county, tot_precincts, precincts_reporting, republican * .8, democrat * 1.2, other, 2, 'foo' from election where event_number = 1);

We wouldn't get such dramatic shifts (modulo Swift Boats) in the real world, but this is PoC territory.


This yields a new Event 2 map:


I'm still grappling with my main wish list item: showing the changes in the colors. As it stands, each map takes the full gamut, leaving the legend to display the shifts; doesn't do that all that well. Viewed another way, why not show the delta of polling strength (vote displays are a bit late, after all)? We can do that with a single map. How to get the data out of the election table? For that a correlated subquery is sufficient. It's that big SQL statement.


Here's what the delta map looks like:


What we see is the shift, in absolute, not relative, numbers. So Texas looks to be more Democrat from Event 1 to Event 2 just because it started with more votes; same with California.

Getting rid of the lat/long grid is still a problem, but then, this is a free PoC. Cheap at half the price.

15 November 2011

The Red and the Blue

I was going to build a US map (using R facilities) showing net federal funds at the state level, but found there are a colossal number of these already. No need to demonstrate that yet again. The point would be to demonstrate doing so within the RDBMS, following in the Triage piece's footsteps. I'll just show the code to generate a US map, as shown by Wickham's book.


CREATE OR REPLACE FUNCTION "public"."test_graph" () RETURNS text AS
$BODY$
X11(display=':5');
pdf('test_graph.pdf');
library(maps, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")
library(plyr, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")
library(proto, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")
library(reshape, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")
library(grid, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.13/")
library(ggplot2, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")
states <- map_data("state") arrests <- USArrests 
names(arrests) <- tolower(names(arrests)) 
arrests$region <- tolower(rownames(USArrests)) 
choro <- merge(states, arrests, by = "region")
choro <- choro[order(choro$order), ] 
print(qplot(long, lat, data = choro, group = group, fill = assault, geom = "polygon", asp = .6) + borders("state", size = .5)) 
dev.off(); 
print('done'); 
$BODY$ 
LANGUAGE 'plr'




Rather more text than the Triage demonstration. R is built on a multi-user model, but is normally used as a standalone application on a PC. And then, there's the *nix issue. The upshot is that nothing need be done to use "base" modules, and those include the scatterplot matrix in the Triage piece. As mentioned in the piece, R supports (at least) two other graphics engines: lattice and ggplot2. Lattice is an extension of base graphics, while ggplot2 is an implementation of a grammar based graphics engine. This Grammar of Graphics is documented (but not a code base) in Wilkinson's book, at 712 pages and no code, we'll see (just Amazoned it)!

This map was created with ggplot2 functions, although no database data is used. It is necessary to call out each package/library explicitly, as well; PL/R doesn't know to load dependent packages, alas. In the context of the Triage piece, the application would show the net position of the party's candidates by state, along a Blue/Red vector. Just so happens that the R installation includes some state level data, which Wickham uses illustratively. One might extrapolate that Red States are more violent than Blue States, on the whole. Not that I'm making such an extrapolation, of course.

Loading of non-base libraries can be done in one of two ways: if the R engine library directory has global write permission (not normally so under *nix) then any package (which is then called a library in use, yeah, I know) loaded by any user goes to the directory and can be referred to directly by PL/R; on the other hand in usual installs, each user has packages installed to a local directory. Since postgres (the engine) runs as postgres the user, the packages need to be installed by postgres (the user) from an R session. In a corporate (and political campaigns are very much so) environment, standards and conventions would need to be established.

Ideally, what I'd want, following on the thesis of the Triage piece, is a clickable map (states), but that gets into non-rectangular html buttons (Google maps, I'd wager); not a topic I'm conversant with, yet. Whether it would make sense to generate the map in R with the cruft needed to implement the button logic is another puzzle. I think not, but not sure; R doesn't impress me as a strong string manipulation language. Ideally, then, the map would not only be generated by R, but each state would be a button, which would call a second R function in postgres to show the county/municipal/zip map. Could be a bit of work, but your candidates are worth it.

Here's the picture (this is a png, since Blogger won't chew pdf):

09 November 2011

Honesty in Government

[UPDATE] -- copied Sales the first time, same issue.

As I transition into data scientist, which means re-adding my stats mojo to my RDBMS mojo (not replacing the latter with the former, by the way), I've come across more than a few postings and writings in the statosphere about truth in data. The writing is always by data professionals (not lobbyists and the like, near as I can tell), and the point is always that the data is truth. By truth one means the most accurate picture of the real world, unadorned by propaganda.

Today's Federal data dump includes September wholesale inventories. They were down .1%. Here's the quote: "..were $462.0 billion at the end of September, down 0.1 percent (+/-0.2%)* from the revised August level." What's the starry thingee, one might ask? Well, it's the link to a footnote.

Here's the footnote:
"* The 90 percent confidence interval includes zero. The Census Bureau does not have sufficient statistical evidence to conclude that the actual change is different from zero."

Two points to note about the footnote: 1) the CI is 90% level, which is very generous and 2) it spans 0, which means what the note says. I wonder how many of the reports about the report will bother to tell us about that.

Here's the link to the original; click the link for Excel or PDF.

07 November 2011

Write Me a Song So Lovely

It's mostly considered declasse' to simply refer to somebody else's post on some subject. And I don't do so very often, but this post on writeable CTEs is too much to pass up, not least because I've yet to move to 9.1, and won't likely until it hits 9.1.3 (currently at 9.1.1). What makes the post of interest: Brown states that PG is alone with wCTE, and some quality time with searching supports that. My interest rests on the motivation for this endeavor: put, and keep, the logic in the datastore and let the client pretty-pretty the screens as it wishes. Writeable CTEs take this notion yet another step forward.

Not that I'd choose a joined table with a calculated, by code, column. But that's another story. I'd be more inclined to implement an inventory/order line update, since that fits with the canonical example of join. For example, if your customers are mixed with regard to price changes, some will get updated prices on unfilled order lines, the rest not. In that circumstance, you'd want to update some order lines when the inventory (or price break) table is updated. That sort of thing. 9.1.3 shouldn't be too far away.

Good on 'ya, as Texans say.

25 October 2011

From Sea to Shining Sea

As follow up, or update, to the Triage piece, I offer up this post from an R-blogger. As it stands, there's no code (the author claims ugliness), but does applaud ggplot2. The latter I expected, in that Wickham's book has a section (5.7) on using maps, but not much detail.

Of more interest, is the data source, shown as CCES on the plots. Turns out that this is CCES. While not real time data, as Sparks demonstrates, R and ggplot2 can show both categorical and discrete variable impact over a map. For the Triage project, one would need internal real-time (or close) for the effort to be worthwhile, but I'd wager that it is.

11 October 2011

A Model Citizen

While it is gratifying to be published by Simple Talk, so many more eyes that way, it isn't a platform where I can continue to prattle on at will. Each piece they publish, most of the time, is a stand alone effort. Since the piece was already rather long, there was one tangent I elected not to include, since it is a separate issue from the task being discussed.

"That subject: cleavages." Well, I only wish (and if you know from whence that quote came, bravo). No, alas, the topic is what to do with regard to fully understanding "bang for the buck". I elided that in the piece, since the point was to show that a useful stat graphic could be generated from the database. But how to discover the "true" independent variables of electoral primacy, and their magnitude? Could it be that with all the data we might have, both for free on the intertubes and costly which we generate, our best model is only 30% predictive? To reiterate, the exercise isn't to predict who'll win (FiveThirtyEight has been spectacular), but rather which knobs and switches a given organizations can manipulate to *change* a losing situation.

If you'll recall, most of the explanatory variables weren't of a continuous nature, that is, real numbers. The fitted lines in the scatterplots used a variation on simple linear regression to fit. The variation dealt with the differing best slopes over ranges. The technique doesn't account for the fact that most of the explanatory variables are either categorical (yes/no) or discrete (strongly disagree to strongly agree).

For this kind of mixed data regression, one typically uses analysis of covariance (aka, ancova). R, as one would expect, provides this. The Crawley book devotes a full chapter to ancova. I'll direct you there. Some say that discrete independent variables can be used directly in simple linear regression. Others would run to ANOVA immediately. Some distinguish categorical variables (gender) from discrete scaled variables (the 5 point agree scale on gun control). It is, suffice to say, not a slam dunk any way you go.

Exploratory data analysis, what R is particularly good at, is where the apparatchiks should be spending much of their effort (not worrying about the entrails of Rails!). Assuming that money is the driver of winning is an assumption, frequently wrong in the real world. Since their organization is large, national in scope, and full of dollars to spend; spelunking through all available data is the directive. That assumes, of course, that winning elections, without regard to policy positions, is the goal. Think of selling nappies.

While the goal of the piece was to display something simple to the Suits, determining a more accurate predictive model, which will be implemented with traditional text output, is the real goal. Same is true of selling nappies. The analogy is not so far fetched, as this book demonstrates; there have been similar treatises in the years since.

10 October 2011

By The Numbers

There's that famous quote from The Bard, "The fault, dear Brutus, lies not in our stars, but in ourselves if we are underlings." As my fork in the Yellow Brick Road tracks more towards (what's now called) Data Science, various notions bubble to the surface. One lies in an age old (within my age, anyway) dispute between traditional (often called frequentist) math stats and those who follow the Bayesian path. From my point of view, not necessarily agreed to exist by those on the other side, Bayesian methods are merely a way to inject bias into the results. Bayesians refer to this "data" as prior knowledge, but, of course, the arithmetic can't distinguish between objective prior knowledge and fudging the numbers.

So, I set out this morning, being Columbus Day (a day honoring Discovery for some, invasion for others), to see whether there're any papers floating about the intertubes discussing the proposition that our Wall Street Quants (those who fudged the numbers) bent Bayesian methods in their work. As I began my spelunking, I had no prior knowledge about the degree to which Bayesian had taken over the quants, or not. Quants could still be frequentists. On the other hand, it is quite clear that Bayesian is far more mainstream than when I was in grad school. Could Bayes have taken significant mindshare? Could the quants (and their overseer suits) abused the Bayesian method to, at least, exacerbated, at most, driven The Great Recession. It seemed to me likely, any crook uses any available tool, but I had no proof.

Right off the bat, search gave me this paper which references one (at a pay site) from the Sloan Management Review. The paper puts the blame on risk management that wasn't Bayesian. You should read this; while the post does discuss the SMR paper on its merits (which I couldn't read, of course), it also discusses the flaw in Bayes (bias by the name of judgment) as it applies to risk management.

Continuing. While I was a grad student, the field of academic economics was in the throes of change. The verbal/evidence/ideas approach to scholarship was being replaced by a math-y sort of study. I say math-y because many of the young Ph.D.s were those who flunked out of doctoral programs in math-y subjects. Forward thinking departments recruited them to take Samuelson many steps further. These guys (almost all, then) knew little if anything about economic principles, but department heads didn't care. These guys could sling derivatives (initially the math kind, but eventually the Wall Street kind) on the whiteboard like Einstein. I noted the problem then, the 1970's. This paper touches on this issue (linked from here). "These lapsed physicists and mathematical virtuosos were the ones who both invented these oblique securities and created software models that supposedly measured the risk a firm would incur by holding them in its portfolio." Nice to know it only took 40 years for the mainstream pundits to catch up.

And, while not specifically about Bayesian culpability, this paper makes my thesis, which I realized about 2003 and have written about earlier: "Among the most damning examples of the blind spot this created, Winter says, was the failure by many economists and business people to acknowledge the common-sense fact that home prices could not continue rising faster than household incomes." One of those, D'oh! moments. McElhone, the Texas math stat, introduced me to the term 'blit', which is 5 pounds of shit in a 4 pound sack. By 2003, and certainly following, the US housing market had become rather blit-y. The article is well worth the reading. There are links to many other papers, and it does raise the question of the models used by the rating agencies. Were these models Bayesian? Were the rating agencies injecting optimism?

Which leads to this paper, which I'll end with, as it holds (so far as I am concerned) the smoking gun (which I found to be blindingly obvious back in 2003): "Even in the existing data fields that the agency has used since 2002 as 'primary' inputs into their models they do not include important loan information such as a borrower's debt-to-income (DTI)..."

This few minutes trek through the intertubes hasn't found a direct link between Bayes and the Great Recession. I know it's out there. I need only posit such as initial condition to my MCMC (look it up).

07 October 2011

Book 'Em, Danno

For those of us of a certain age, the notion of physical books is important. I recommend any and all of Nick Carr's books, which deal, in significant manner, with ... books.

After finally figuring out where the house is, UPS dropped off my copy of "Visualizing Data" by Cleveland a day late (the widely regarded as incompetent Post Office and FedEx and the Pizza Guys all manage to find it). It's published by Bell Labs/AT&T (back when it still sort of was, 1993) and Hobart Press which is kind of down the street from Bell Labs. Their only listed books are Cleveland's.

What makes me giddy is what's printed as the end of the Colophon (few books even have such any longer). This is it:
Edwards Brothers, Inc. of Ann Arbor, Michigan, U.S.A., printed the book. The paper is 70 pound Sterling Satin, the pages are Smythe sewn, and the book is covered with Arrestox linen.

This is a real book. See you in a bit. Time to do some reading.

04 October 2011

King Kong, Enter Stage Right

Well, the Gorilla just sat on the couch. Oracle OpenWorld has this announcement.

Buried kind of deep is this:
Oracle R Enterprise: Oracle R Enterprise integrates the open-source statistical environment R with Oracle Database 11g. Analysts and statisticians can run existing R applications and use the R client directly against data stored in Oracle Database 11g, vastly increasing scalability, performance and security. The combination of Oracle Database 11g and R delivers an enterprise-ready deeply-integrated environment for advanced analytics.

OK, so now the King Kong has adopted R. Do you see a trend?

03 October 2011

Don't Pay the Piper

Big news day, today. And yet more of interest. We don't need no education.

This is specific to Britain, of course. Note that tuition is £9,000 (at today's rate, that's about $15,000), which is a piddling amount here in the USofA. Community college might be cheaper, in state and all that. More reactionary, back to the dark ages, assertions. Education isn't just vocational. That's why the business has VocEd and real college. And sure, if you want to be an Excel whiz, then learning all that mathy and logicy stuff is boring and a waste of time. I mean, how much do ya need to know to slap together a PHP web site?

Deja Vu, Yet Again

My (single?) long time reader may recall that I concluded that the Oracle buy of Sun wasn't about java or Solaris or any software. It was about stealing the one segment of computing Larry didn't own: the mainframe, the IBM mainframe. I was initially alone, so far as I could see, although in the months following, I would read an occasional story tending toward the hardware motivation. If memory serves, some Mainstream Pundits explicitly stated that hardware was dead in the new Oracle.

Time to feast on some baked bird, crow specifically. Here's the latest from Oracle.

"'We want to take IBM on in their strongest suit, which is the microprocessor,' said Ellison."

Oracle may, or may not, be able to pull it off. Given that IBM's DB2, off mainframe at least, is adopting (well, it depends on who's defining that word) MVCC semantics, one could conclude that Oracle has gotten the mindshare part of the problem solved.

Political Science

A while back, simple talk offered me an article, suggesting that something controversial would be appropriate. I pondered for a bit, and decided not to throw Molotov cocktails as I usually do here. Instead, based on some abortive conversations with apparatchiks in Washington, I set out to demonstrate how one can generate dashboard style graphs using stat output from R all within the database. In this case, the database is Postgres. Here's the piece. Enjoy.

01 October 2011

Are We There Yet?

An update on the world of (semi) serious SSD is in order. The Intel 710 is the successor, sort of, to the X25-E. AnandTech has a review and status update. Worth the read for the industry background alone.

The clearest description, and the one that was most logical: "Fundamentally, Intel's MLC-HET is just binned MLC NAND."

I'll mention in passing that AnandTech is dipping a toe into "Enterprise SSD" review with a piece on OCZ. Not that OCZ is really serious, of course; the Sandforce controllers depend on clear text data streams, which are getting yet more scarce in the Enterprise.

29 September 2011

Pretty as a Picture

Along with an interest in stats and graphs comes a level of responsibility. Kind of, guns don't kill people, people kill people. The canonical text is "How to Lie With Statistics", which was first published in 1954. Legend has it, it's never been out of print. Likely so.

It so happens that I've found a couple of blogs/sites which both deal with graphing stat data in non-disinterested ways. I'll note once again that a stat/quant/analyst/foobar is supposed to be disinterested. S/he's just an impartial judge of the data, trying to scope out the real relationships in the data; if there are any, there may not be. Data associated with politics is particularly susceptible to bias. But others face the same pressure. Worker stat bees (having been there) are often encouraged to slant the presentation in a way to make the nappie marketing Suits look like geniuses. It's a problem everywhere; all worker bees are expected to behave as attorneys; staunch defenders of whatever the Suits have done.

Watching the response to drug clinical trials is particularly amusing. Rather often, the sponsor will be shocked (shocked, I say) that its new FooBar Resolver didn't blast the .05 requirement out of the water. There'll be "unexpected placebo levels" or "unbalanced randomization" or "the FooBar Resolver patients were sicker than placebo". And so on.

Be that as it may, here are a couple of sites worth grazing:
The R Graph Gallery, from Romain François
The Gallery of Data Visualization, from Michael Friendly

27 September 2011

Figures Don't Lie, But Liars Figure

I just found this link, which says it all (well, most all) about lying, stats, and graphs. It's only a bit beyond 5 minutes. Time well spent.

25 September 2011

It Ain't The Meat

Back in the 70's a married lady (but not to me) of my acquaintance had a preternatural affinity for the Maria Muldaur song, "It Ain't The Meat It's The Motion". Nothing to do with me, I'll warrant. It was the 70's, of course, and it means what you think it does. Still true today, but the context relevant to this endeavor is a bit different. Welllll, may be a whole lot different.

One of the neat aspects of R is the ability to talk to most any other application, and vice-versa. R is, justly, known for the support for graphical display of statistical data. googleVis is an R package which links R to the Google Visualization API, empowering "moving" data in an html page. I've not played with it yet, but here's a sample from a blogger who has. Yet another case where the R community builds spectacularly useful widgets for the rest of us to exploit. Who said open source is anti-American communism? For the record, at least little Darl.

A few years back, I was involved with Business Objects, building dashboards. But using BO requires building a shadow schema of the RDBMS it talks to, and runs as its own application; generally a pain in the butt. With R, and PL/R with Postgres, one can drive the data and statistical analysis applications through the database. With googleVis, one can create animated graphs into the browser. Very cool. And his talk was on my birthday. Damn, I missed it.

The advantage of moving graphics is that this is a way to display higher dimension data; using bubble charts and animation, we get four dimensions, the bubble size and the motion axis (classically, time).

There are other plotting packages, beyond the base plot() functions, but I'd be willing to say that googleVis is the least difficult of the bunch. It does mean that it's for browser applications.

19 September 2011

Newest Meme: NoClient Database

What's that line from "Network"? "I'm as mad as hell, and I'm not going to take this anymore." Such is my view of NoSql nonsense. I'm not quite as mad at client coders who want to rule over the database, but close.

So, it was heartening to read Dunstan's latest post, in which he describes the end result of banning the client from the system. Save for rendering pretty, pretty screens I gather.

Good on him.

27 August 2011

Don't Mess With Texas

When I was somewhat younger, I worked for a math stat who was born in Rhode Island, grew up in Las Vegas, and did his graduate work in Austin. This was when I first heard the phrase "Don't mess with Texas". Context is everything, and today the context is data storage. Here's one version of the news. Which wouldn't be all that interesting, given that Texas Memory has been doing SSD for decades.

No, what makes this of interest is the following quote:
"TMS is targeting relational databases with its new storage device, just as Fibre Channel drives would be used as the primary storage."

Rad. BCNF support in the flesh. YeeHa.

24 August 2011

Epiphany

Whilst bloviating on the OCZ message board, I had an epiphany. It follows, including the snippet upon which I was commenting.


-- Maybe the strategy is to sell more consumer products at a low GM so they can increase the brand awareness of OCZ which will help sell more enterprise products that have huge GM's.

Not likely. The Enterprise SSD vendors, modulo Fusion (may be), build parts which are lots more expensive, and generally have bespoke controllers and SLC NAND (eMLC, whatever that might really mean, too).

To the extent that Enterprise SSD goes the route of Enterprise HDD (buy 'em cheap and swap 'em when they crap out), OCZ could best the likes of STEC, Violin and Texas Memory. We're not there yet; whoever figures out how to make a cheap SSD which dies gracefully wins. That may not be possible, given the physics, of course.

19 August 2011

Viagra At Home

A bit of R. I've mentioned a few times that I "knew" we were headed into the ditch around 2003. I don't recall that I'd read Shiller at that point (or even that I was aware of him), it was just obvious that house prices were outstripping median income. The raw data is available (at Shiller's blog, http://www.econ.yale.edu/~shiller/data.htm no idea how long it has been), so here's a picture worth a few words. The data run from 1890 to 2009.




Where's my little blue pill???

18 August 2011

Old Frankenstein

In "Young Frankenstein", The Doctor asks Eye-gor (Igor) whose brain he *really* retrieved. Igor replies, "Abby Normal?" I've spent the last hour or so wandering amongst some web sites, blogs, and whitepapers which seek to explain Normal Forms to normal folks; no math, just words.

This one says this: "'Normalization' just means making something more normal, which usually means bringing it closer to conformity with a given standard." Alas, not even close.

Which, since I've been re-reading my probability, stats, and stat pack books and docs, this flipped a switch. Which switch leads to a clearer, albeit slightly mathematical, definition.

I've done a quick search, and can't confirm that he explicitly said so, but given that Dr. Codd was trained as a mathematician, I'll surmise that he used the word in the following sense. In math, two terms are used as synonyms, orthogonal and normal. Remember from geometry class that a 90 degree line is the normal line? It's also orthogonal. Orthogonal as a concept means independence of influence (just as the X axis is independent of the Y axis; there some math), and Codd uses that term liberally in his paper.

So, the normal forms have nothing to do with not insane or seeking standards, but with data independence. Which is normal.

16 August 2011

How To Mistreat Life

It is amazing, but so far as I can remember, web apps have gotten more than half-way through 2011 before an article which takes client side code to task for being silly. Hard truth #1 is the worst. And the only way to avoid it: database enforced integrity. There, I said it again. NO DATA GETS WRITTEN WITHOUT THE ENGINE'S SAY SO.

08 August 2011

The Know Nothing Party

I came upon this rant/essay via R-bloggers. Beyond the fact that Zed (love that name) has a background fairly close to mine, is an R afficionado, and is willing to call the Emperor naked, one could substitute "RDBMS" for "statistics" in his piece. It would then read like a few of those which have appeared in this endeavor.

I really should send the link along to some of those folks in Washington I've chatted with over the last few weeks. Nah. They wouldn't get the joke.

Of particular relevance:
"It's pretty simple: If you want to measure something, then don't measure other shit. Wow, what a revelation."

04 August 2011

And The Survey Says...

As my dive into stats, and possible departure from RDBMS as the site at the end of the Yellow Brick Road, continues, I came across a ruby library called fechell. My inital thought: "shouldn't that be fechall, as in Fetch All, Fetch Ell. What does that mean? Well, D'oh! The normal name for the code is FECHell. Ah, much more to the point.

I found two posts, by way of R-bloggers by the person who developed the library. Here's the post where he develops the use of the data and the library. He references a Part 1 post with the background.

This intrigues me not a little bit. Suppose, just for grins, that you're the campaign manager for a state wide (or larger) candidate. That is, one where monies are allocated to distinct locations. Further, suppose that you have this data in close to real-time, and you also have data measuring "outcome" for the use of these monies, say polling data. And let's say that the two maps, monies and outcomes, are congruent.

Could one make predictive decisions about monies allocations? Well, it depends. The naive' answer is: abso-freakin-lutely!!!! The real answer: not so much. The naive' notion is that money well spent is indicated by winning the election (which is kind of too late for allocation decisions) or some upward movement in polling data. Ah. Let's spend where the spending works. Superficially, makes a lot of sense.

The only problem: stat studies invariably show little correlation between money and winning. I know, Liberals in particular are worried about the Citizens United effect, where corporations have gobs more loot than anybody else. They'll just buy the elections. And they well might. This would not make me smile. But, the studies of the data show that the effectiveness of campaign ads is less grounded in their expense, rather their content. Sometimes, may be often, attack ads work.

Here's an academic attempt to find out.

And yet another.

A quote from the second story (not, that I know yet, cited from the study):
"While we see an influence of the campaign ad in the short-run, in the long run the ad loses its effectiveness. This finding begs the question: how cost effective is it for politicians to spend millions of dollars on campaign ads which have little long-term effect on voter opinion?"

StatMan to the rescue!!! The problem is that it's now August, 2011, and any application being written as I write (assuming that folks have started) need to be up and running by January. In order to be worth the time and money expended, the application has to have *predictive* value. FECHell data passed through some software is only retrospective. Political ops should know enough about their candidates and opponents to design ads that work. Making a simplistic leap from $$$ to polling/winning is a waste of that time and money. The retrospective data needs to be run through some multi-variate hoops (either multiple regression or ANOVA, most likely; PCA and MDS are less applicable here) to identify the attributes, besides money, which move the bar toward higher polling or winning.

The problem with the simplistic model is that the knee jerk reaction to positive feedback in some campaign is to toss yet more money to that campaign. But that's likely a waste of money. The goal is to use the data to identify those trailing candidates today who'll win tomorrow if they get more $$$ and *spend it on what works*. Pouring money into a winner is a loser. Pouring money down a rat hole is, too. The latter case is more obvious, but the former is just as wasteful.

Economists refer to "opportunity costs"; I can spend $1 on toothpaste or candy. I can't have both. In the short run, candy is dandy. In the long run, toothpaste wins. Campaigns don't, generally, last as long as the toothpaste's long run, but you get the point. Money is finite, and should be spent on those activities/goods/services which gain advantage to the goal. In the case of FECHell data, the goal is winning elections. Looking retrospectively only at $$$ and winners is just the wrong goal.

29 July 2011

STEC Crashes

What's it all mean? Beyond losing a bunch o' cash for those holding (not I)?

It might be a bad thing for BCNF on SSD, but may be not. It kind of depends. According to reports from the conference call, STEC parts are being replaced by its clients (who are mostly storage vendors, not the user enterprises) with cheaper SATA drives and protocol morphing dongles. If true, then STEC's fall, while not good for them, is not relevant to the SSD Revolution.

On the other hand, if this means that SSD is being shifted aside from primary datastore to cache/Tier0/foo, then it bodes ill for my version of the Revolution. In Enterprise, at least. I could live with that. Enterprise has an absolute reactionary tilt; they keep 40 year old COBOL systems alive. Why isn't there a Do Not Resuscitate for dying code?

New systems, from smaller builders, are where the "innovation" will come from. I can live with that. If I never see the inside of a Fortune 500 (as an employee, that is) building, that is perfectly OK.

28 July 2011

Mongo Loves Candy

I recently chatted with some folks about real databases, in RoR, to solve real problems. Not so sure they're interested in real databases, but they're interested in Rails. Along the way, they mentioned that they'd been using Fusion-io SSDs. Be still my heart! Turns out that they've a separate datastore, in MongoDB, which had become as slow as molasses uphill in winter. So they bought a 1T Fusion-io card, in hopes of speeding things up. Didn't work out.

What's not widely understood about PCIe SSDs is that they're, more or less, heavily dependent on the cpu to get the work done. Or, as Zsolt puts it (on today's front page): "how much of the host CPU power is needed to make the SSDs work? - this is important if you're trying to fix an already overloaded production server - because you can't afford to lose performance while you tune the hot spots (even if the theoretical end point of the tuning process is faster)". I suspect they might decide MongoDB is the problem (document datastores make my teeth hurt). SSD with BCNF databases will generate real performance improvements. PCIe cards are not indicated if the problem is cpu bound.

One can find out, well enough anyway, whether the process is cpu or I/O bound with iostat and vmstat on *nix systems. That's the place to start.

27 July 2011

Ohm's Law

I've gotten to enjoy Christophe Pettus' postings linked from the PostgreSQL site. He does a neat presentation. This is his latest. Note especially pages 50 and following. While he's a Python/Postgres kind of person, and I'm currently exploring RoR again (long story), he does say things the way I do. Not quite as famous as he is, of course. In the database is truth. Note in particular his observations with regard to "cloud" I/O; it's what I've always suspected. It's your data, don't treat it like a red haired step-child. The SSD is the future of normalized, i.e. fast, data. The "data explosion" is largely the result of bad (non-existent?) data modeling. Cloud is all about minimalist/commodity parts which are easily re-assignable. If anything kills off the RM, it will be public clouds. Coders get infinite employment, and the profession relives the 1960s. Sniff.

So far as that goes, what he's saying about coders abusing the database from Django is about what I've seen with coders abusing the database from RoR; may be more so, given David's attitude toward data. The problem with ORMs is that they seek to solve a problem created by OO coders, but which doesn't exist in the Real World. Such coders refer to the problem as Impedance Mismatch, which is merely an assumption that objects can't be populated with data from the RM. But it's just an assumption. What they steadfastly (shades of Tea Baggers, what?) refuse to acknowledge is that BCNF databases allow for construction of arbitrarily complex data structures, unlike the hierarchic/IMS/xml approach, which is locked in to a parent/child structure. Change that, and all the application code which manages it has to change. Well, unless you've written a bare bones RM engine into your application. Don't laugh; I've lived through folks doing just that.

The world isn't hierarchic, no matter what OO/xml folks want to assert. I've worked lots of places, small to huge, and the archetype for the hierarchic structure doesn't actually exist. That structure is the Org Chart. In the hypothetical world, each worker bee has one, and only one, supervisor. The real world is run on Matrix Management, one has supervisor du jour, never the same one each day, varies by project/location/assignment/foobar. The real world is relational, connections come and go, in vivid multiplicity. The relational model stores such natively. From this structure can be built any set of connections which arise. By *not predefining* the connections, only the absolute identities of each type/rule, one can create new relationships simply by naming new foreign keys (cross-reference tables, by various names, for many-to-many relations).

One can also add new data without (if one has been moderately smart with the DDL/SQL) clobbering any existing SQL (or, heaven help us all) application code which directly queries the DB. Existing queries can ignore, if desired, new columns and new tables; so long as one avoids 'Select * from ...', of course. You would never do that, right?

13 July 2011

M'mmm, Kool Aid

I'll include some of the text, since the way Zsolt's site is structured, entries tend to disappear down a rabbit hole. Today, this is still front page. Go there to finish it up: he's got quite a lot to chew on, and he does the site for a living.


Editor:- July 11, 2011 - I recently had a conversation with a very knowledgeable strategist at a leading enterprise storage software company. I won't say who the company is - but if I did - most of you would know the name.

The interesting thing for me was that he'd recognized that if the hardware architecture of the datacenter is going to change due to the widespread adoption of solid state storage - that will create new markets for traditional software companies too.

And I'm not talking here about new software which simply helps SSDs to work or interoperate with hard drives - but software which does useful things with your data - and which can take advantage of different assumptions about how quickly it can get to that data - and how much intensive manipulation it can do with it.


While he doesn't say BCNF-RDBMS in his text, he's saying it. I've been bugging him for some time to drink the Kool-Aid. Sounds quite like both he and the unnamed "strategist" (no, not I, alas) have quaffed deeply. Face it, if all you do is keep appending flatfile "fields and records" to some file, not only do you never get ahead of the bull, but you get gored sooner or later. BCNF is the *only* hope. Yes, this requires designers/developers to actually *think* about the data. But, isn't that why we get paid the *big bucks*?

(OK, I went a bit asterisk nuts with this one. Finding validation does do that.)

12 July 2011

What's Up Doc??

Another in the occasional post from elsewhere. This time, simple-talk (no surprise there) with thread on bugs. Herewith my contribution, because the issue of buggy software can't be divorced from the application architecture and data language.


Ultimately there are two categories of bugs:
A) those caused by stupidity, inattention, carelessness, etc.
B) those that are the result of extending the developer's/team's experience

The entire ecosystem around each is necessarily different, and there are multiple approachess.

The A variety will be dealt with as the ethos of the organization dictates; anywhere from fired on first mistake to employed forever out of harm's way. Detecting such bugs should be possible with known testing harnesses/practices.

The B variety is more interesting.

For those in the BCNF realm, much of what passes for "new technology" in data stores and processing is VSAM redux, which brings with it the COBOL RBAR mentality, irregardless of the source language. This POV is wrapped in whatever jargon is native: NoSql, Hadoop, Map/Reduce/BigData/foobar. But the fact remains that coders are implementing ACID (if they care at all about their data) in some (high-level) language outside the storage engine.

Whether the organization realizes its mistake, and implements engine side processing, a la Phil's current article, or undertakes to use the FOTM client side framework, the coders are left in unexplored territory.

Left unexplored, generally, is an analysis of what architecture (engine side vs. client side vs. application language vs. database engine [not all do all things well]) is the least prone to both type A and type B errors for the application in hand.

Declarative languages (SQL, Prolog) just tend toward fewer errors. SQL is dependent on schema quality, and coders tend to view schema specification as a low value, unimportant task. Certainly not one for which specific expertise and experience is required; any coder can do it.

The bugs that matter, which mess up the datastore, are just less likely if processing stays in the engine. Bugs which consist of ugly fonts, not so much.

As to IT managers, again, two categories: those that were and are technically superior, and those who never were. The former, albeit rarely do they exist, generally get more done. The latter do awesome Power Points.

09 July 2011

Workin' on the Chain Gang

LinkedIn, LinkedIn whatever are we to do with you? I've not had anything to say, given how silly the whole mess is, but today's NY Times has an almost true article. I don't have meaningful disagreement with the problems raised in the article, but it avoids the underlying issue. (That it makes my thesis that advertising based business is inherently unstable, is another atta boy for me.)

No, the problem with LinkedIn is that the business model is foolish. The business model is based on the assertion that people without employment and income will rush out to buy stuff. How stupid is that? One can slather on all sorts of finery, but that's the business model. At least Google attaches ads to activities utilized by everybody.

There's a reason that employment agencies charge money for their services; they actually do some work. Most of it is negative, removing for essentially arbitrary reasons otherwise qualified folks. LinkedIn presumes that if an unemployed is known to the employed, that this will embolden hiring agents to consider an unemployed for a position. Factually false. Been there, done that. Employers, though it be illegal, are more than willing to admit not interviewing an unemployed.

What, then, about those on LinkedIn who are currently employed? Will they be buying stuff? May be. May be not. The folks from my last employer that LinkedIn offers up each week or so, for instance. Are they looking? I don't know. I do know that 99.44% of them have never worked anywhere else (both young and old) or on any other software. In many cases only the decades old COBOL that constitutes the application. On a mainframe. Will such folks be buying stuff? Probably not.

Near as I can tell, LinkedIn, whether its progenitors say so or not, is attempting to implement what the high end (or low end, depending on your point of view) agencies promote: access to the hidden job market. Whether such actually exists has been a matter of controversy at least since the 1970's, lawsuits and all. For companies large enough to have an HR department, ain't nobody gettin' through without they go through them. It's job preservation, after all. For the SMB crowd, it might work. For startups (where the really interesting, and vastly stupid, activity is), even less so.

LinkedIn is a bottle rocket, soon enough to come crashing down. Google needn't worry that it is the advert server to fear. There will be such an advert server, as I have written. LinkedIn isn't it.

30 June 2011

I'm Just Going Through a Bad Phase

"Phasers on stun", said Captain Kirk. Well he said that just about every episode. Today ComputerWorld reported on IBM's PCM flash replacement. For background on PCM, see WikiPedia, and the article has a link to a much earlier one on PCM.

From the WikiPedia piece:
"PRAM devices also degrade with use, for different reasons than Flash, but degrade much more slowly. A PRAM device may endure around 100 million write cycles."

In the past, I've written about Unity Semiconductor, which has had a flash replacement in development for some years; I found them when I first began looking into SSDs. One way or another, we'll soon have a solid state datastore that is effectively infinite in write capability, just like HDD.

Once again, Oz looks larger and brighter. Be still my heart.

28 June 2011

A Bump in the Yellow Brick Road

No, I'm not renouncing. Events of the last year or so caused me to ruminate on this journey down the Yellow Brick Road. Some of the events:

Consumer/prosumer SSDs persist in not being built with data caps. The industry is, perhaps, more divided now than at any earlier time. Consumer devices use barely tractable MLC flash (~3,000 cycles), and SandForce continues to gain traction in the consumer side. Given the finite nature of flash, an SSD will die in the near future. An HDD, on the other hand, might well continue to function for the better part of a decade. In any event, the HDD doesn't have defined drop dead time.

Capacity remains under a TByte for the vast majority of parts. This is important because:
most folk continue to view SSD as just a faster HDD; which isn't important outside of the RDBMS arena, but critical to getting the most bang for the buck there. For RDBMS installs, where (re)normalizing is ignored, the cost of moving from HDD to SSD is expensive, so is often attempted with consumer level drives. In the HDD world, that's not unusual; most drives are both over there.

Small scale (web and SMB verticals, for instance) databases, often on MySql or Postgres, just won't be safe enough on consumer drives. The various threads on postgresql-performance make the case, much as I'd wish the truth be otherwise. What's particularly odd is that both vendors and most consumers appear to be OK with catastrophic loss of data in normal life. Very odd.

Given the physics of writing, SSD vs. HDD that is, is just way cool different. SSD controllers spew the bits all over the flash, and the erase process can hardly be considered atomic. The majority of SSD controllers use RAM caching to reduce write amplification, and this is an additional fault point. HDD based engines, industrial strength ones like DB2, can guarantee that only the open transaction(s) will be hosed on a failure. SSD based storage just can't if there isn't persistent power available.

The failure of developers, at least those who publish, to lobby for (re)normalization as part and parcel of transition from HDD to SSD is regrettable.

Is there still a Yellow Brick Road leading to Oz? I still believe so, but Oz looks to be more a Potemkin village than a New World. Only shops with the fortitude to make a full transition using enterprise quality SSDs will actually get there. One can eliminate 99.44% of web sites and SMB verticals; they're just content to be penny wise and pound foolish. Oh well.

14 June 2011

Sprechen sie Deutsch? Habla Espanol?

Artima has been quiet of late, not many new articles or comments. Could be that coders are satiated? Then I went over today, and Bruce Eckel has praise for Scala. Last I read, he and Bruce Tate (not related, so far as I know) had gotten Python fever. Well, Eckel has been infected for some time. Which is why I never expected to get a Scala piece from him.

Smitten, I ordered "Programming in Scala"; finally bit the bullet. It's one of the few languages, not on a database engine, that still intrigues me. Yes, I've done Prolog, Erlang, and Haskell, to name some. And, yes, Prolog is the closest to a RM language out there. But it's just not used enough. That, and the syntax is just plain wacky (not as extreme as Lisp, but that's to be expected). 2.8 is the current version, and is covered in this second edition.

But, back to Eckel's article. What struck me, yet again, is the emphasis on iteration in his discussion. Yet another language creating yet another syntax to loop. Why are we still doing this in high level languages, not-assemblers? For some code, that which doesn't deal with real data in databases, it could be argued that application code needs to loop. But, even then, I don't quite buy it. I spent/wasted some years with Progress/4GL, a database engine + application language. It had flippant support of SQL in the engine, but 99.44% of coders used the bundled 4GL. And how did this language deal with table data? You guessed it, --For Each-- . Now, this was promoted as a *4GL*, not COBOL. Fact was, it was effectively COBOL. We've been saying to each other for at least three decades that the future is now, and the future is declarative coding, yet we keep focusing on application level iteration. The datastore should do that. It's written in the lowest level language, typically naked C (with, I suspect, performance bottlenecks in each support OS assembler).

Let the datastore be the datastore!! (Yes, that does remind you of some political hackery from the 1980's.)

Declarative development is exemplified by the RM and RDBMS. Why the refusal?

We continue to see web/client app code which, even on database support/discussion sites, we're concerned about sending whole result sets to the client. Why? If you're intent is to make changes to multiple rows in a specific fashion, that's a stored procedure. Do it all on the engine. That's what it's good at. Don't ship data off the server, just so you can iterate (using the very special syntax of your fave language) over thousands or millions of rows. This is folly. Such discussions always then devolve into arguments about transactions locking rows; and so forth. Yikes!

While it isn't quite what one might expect, given its title, everybody should read Celko's "Thinking in Sets". It's not a treatise on set theory in the datastore, but still useful in providing examples where table data makes more sense (even in a performance metric) over code.

09 June 2011

My Security Blanket

Regular readers may note that Linus Torvalds has pride of place in the quotation section of this endeavor. He's recently been interviewed by some e-zine I've not heard of.

Here's a quote:
I'm also a huge fan of SSDs, and the huge reduction in latency of storage technologies has some big impacts on OS performance. A lot of people end up spending a lot of time waiting for that traditional rotational media.

Pretty much what he said back in 2007. Let's get a move on.

07 June 2011

Fiddlers Three

In the event that you haven't been following the news, Fusion-io is in the midst of an IPO. It turns out, that Violin Memory is closing in on doing so, too. Since I began digging into the SSD story some years ago, Violin has been kind of (but not quite) stealthy. In part, because it is still private, so doesn't end up in various "investor" discussion forums, and in part because it focuses on the True Enterprise SSD; Fusion-io and OCZ in particular are public and consumer focused.

The article mentions Oracle, which fits with my vision of where SSD databases are going. Eventually even Larry will figure out that the bang for the buck (his, not necessarily his customers) lies in Being Normal. For Larry, that's saying a bit.

31 May 2011

You're Solid, Man

If you're of a certain age, "solid, man", has a certain meaning (if you go here, and open up the list and then go to the bottom. SolidFire is a company which claims to be building SSD in the cloud; somehow or other I got on the early announcement list.

According to their press section, they've been in business since January of this year. It'll be interesting to see whether they make the connection between the value of SSD and the value of the RM/SQL/RDBMS. Minimize footprint, wear SSD galoshes.

24 May 2011

Do the Limbo Rock

There's been some whoopla recently about OCZ's latest drives, and the SandForce controller used. Today, AnandTech looks at the latest of the latest, the Agility 3. Of most interest to this endeavor is the background on NAND architecture.

What's it all mean, Mr. Natural? Well, with respect to Enterprise (or even SMB) systems, not much I expect. OCZ and SandForce remain, so far as I can determine, firmly in the consumer/prosumer territory. Until SF drives start showing up from a major vendor, of course (there's not been an announcement from SF, yet). I don't see that happening, what with the implementation of encrypted/compressed data in industrial strength databases. Workgroup/internal document storage, may be.

05 May 2011

Mr. Wonka's Factory

Here's a post on simple-talk which was dormant for most of a month (it's down to the last place on the front page today), then got really popular the last few days.

The subject: cleavages; well no, but I've liked that line since Lou Gottlieb used it to introduce a song lo those many years ago. Well, may be, actually, in a manner of speaking.

The subject is database refactoring. The post and discussion started out blandly, but then opened up (cleaved, see?) that long festering wound of front-end vs. back-end design/development. Simple-talk subjects and discussions tend toward a coder perspective, I suppose because T/SQL is often germane to topics, so a certain amount of self-selection exists both for authors and readers.

The idea of refactoring a database, especially given the tripe espoused by the likes of Ambler (some of his stuff), has been hijacked by the kiddie koders. Refactoring, for a RDBMS means getting more normal. It doesn't mean twisting the schema to fit some single access path, a la IMS or xml (hierarchies). That's not why Dr. Codd made the effort. He made the effort because he saw the mess that IMS was making; he'd been there when IMS was released and devised the RM a couple years later. In other words, it didn't take a math guy very long to identify both the problem and the solution.

Now that we have high-core machines with SSD as primary datastore, we can implement BCNF (at least) schemas/catalogs with no worries about "let's denormalize for speed". Those days are truly history.

03 May 2011

Get Offa That Cloud

On more than one occasion, I've criticized all things cloud, since I see cloud as a lowest common denominator storage approach.  But then, this is based only on my experience with cloud-like provisioners over the last couple of decades.  Well, turns out I'm not the only one, and some of these folks have as much fun with the silliness as I have.  The thread is new, so keep track for further hilarity.

01 May 2011

Ruler of All I Survey

There's that old phrase, "master of my own domain", and it is particularly useful in relational databases. While working in the COBOL oriented DB2 world of Fortune X00, I was continually rebuffed whenever I suggested the use of check constraints. I had to sneak them into some of my lesser databases, but that was cool.

In addition to regular check constraints, there is the notion of domains. They're sometimes referred to as user defined types. Here's a Postgres based treatment. Interestingly, Postgres has better support than DB2; PG allows check constraints while DB2 doesn't. Once again, the author's make my point (admittedly, the point made by anyone who takes RDBMS seriously) that control of the data in the database, rather than relying on each application code. Bulk loads, with most engines, enforce constraints so batch data transfers are a breeze. Not to mention that client code can be in any convenient language. And that a smart generator can use domain/UDT definitions whilst doing its thing. I've said that before, I believe.

25 April 2011

He Ain't Superman

Back in the early days of java and the time of the dotcom bust, George Reese was a minor pundit/author (O'Reilly division) in the database part of the world.  You can look up his stuff at O'Reilly or Amazon.  Not much heard from since then.  He would occasionally show up on the O'Reilly site.  And he has again.

Here's his take on the Amazon fiasco.  What's so intriguing about this missive is the Up is Down meme involved here.  The Cloud meme has been promoted as a less: expensive, time, resource, attention answer to the Data Processing Problem, particularly for web sites.  Reese spends his few thousand words telling us that the SLA is *our* responsibility, not Amazon's.  Ditto for infrastructure design.  Ditto for physical design.  Ditto for just about everything.  Well, you're not allowed RDBMS, but you didn't want that anyway, did you?

He spends all that ink telling us that we have to work around the fragility of Cloud provision, in order to utilize Cloud.  And we can't do databases, because, well, Cloud just isn't quite up to that.  Where, exactly, is the win for clients who care about their data?  He admits that such services won't provision such that there is sufficient excess capacity to support the loss of significant resources.  Kind of like what might happen if you ran your own datacenter, only worse. 

What's most annoying about his missive is that he makes, nearly explicitly, the assertion that the storage method, Cloud, determines the nature of the datastore.  Not only can you not have SSDs as primary store for a BCNF databases, you can't even have *any* sort of RDBMS if you use Cloud.  Last time I checked, that's the tail wagging the dog.  But knuckleheads who admire the Emperor's New Clothes typically ignore such conflicts.

Oh, and you really should go read the piece.  The commenters have a field day with his silliness.  Yum.

[UPDATE]
In my haste to get this in words, I neglected to explicitly state the objection to Cloud I (and others) hold:  the value proposition for Cloud is that it enables organizations (even just the IT group on its own) to out source an unwanted responsibility at lower cost; don't do it yourself and save money.  Reese's treatise, and the failure, deny the existence of that proposition.  This, while I gather he doesn't get it, is kind of a Big Deal.

24 April 2011

Was A Cloudy Day

As regular readers know, I've not been a fanboy of anything Cloud.  My reasons are less to do with security, reliability, and other mundane considerations; rather that Cloud represents lowest common denominator (race to the bottom) disk storage.  As a means for storing family vacation photos, well, OK.  I'd prefer to keep those on my own storage, but each to his own.  But Cloud for serious storage, I've never been a fan.  My Yellow Brick Road is paved with SSD running BCNF databases.  Unless, and until, Cloud provisioners recognize that it ain't "about just bytes", I'll pass.  Some times, if it's too good to be true, it ain't true.

Then Amazon jumped the shark.

Here's another take on the situation.  This is from an experienced Cloud provider, of a sort.  As he says, the Fortune X0 have been trying to provide a central storage solution for rather a long time, with little obvious superiority.  It's worth noting that the IBM Service Bureau service goes back to, at least, the early 1960's.  Cloud is neither new nor a walk in the park.  I guess Amazon and its clients now know that.

The Service Bureau (and similar) were able to provide some semblance of service over leased lines.  The notion that TCP/IP, with HTTP tossed in, over normal phone lines is sufficient is, well, immature.  What was that you said?  The Emperor has one fine set of threads.  Yes, yes he does.

20 April 2011

Not A Cloud Was in the Sky

On more than one occasion, I've made the point that "cloud computing" is really very old hat.  Or, old wine in new bottles.  I came across this interview with IBM honch Steve Mills.

Here is a clip:

Mills: I think that's the number one reason why this is appealing to CIOs. The interesting part about "cloud speak" is that many people want to isolate their discussion of cloud to only certain classes of companies. And my view is that's too narrow. Service bureaus emerged in the 1960s. I mean, ADP is one of the industry's biggest and most successful cloud companies.

Knorr: And earliest, right?

Mills: Yeah. If I'm in the accounting department and I use ADP, ADP is my cloud company.


The "service bureau" is actually older than ADP, and was an IBM invention.  See this search result.

06 April 2011

Your Diagnosis, Herr Freud?

Just a short note about the new Intel 510/320.  In a nutshell, Intel claims the Marvell powered, sequential biased, 510 is the Prosumer part while the Intel powered, random biased, cap protected 320 is the Consumer part.

This makes no sense at all.  The early reviews, both systematic and informal, say that the 320 is a righteous part.  I'm going to order the 160 gig in a few weeks (or so), depending on whether there're reports of funny business.  Amazon shows two part suffixes: B5 and K5; the AnandTech review is a K5 part, so that'll be the one.  Now I can, finally, get around to running some data (leave the other stuff to the side for a week...) through HDD, G2, and G3 (as the 320 is being called).  Fingers dirty.  Yum.

29 March 2011

What's Your Preferred Position?

Regular readers know that I've been talking up the synergy between iPad type tablets and the normalized relational database.  And that such synergy must recognize the nature of input on a tablet.  Little bits of pickable data.  Hors d'oevres, so to speak; not a four course meal.

A few minutes ago Anand posted a quiz.  I guess the question has reached the mainstream pundit class, although it's not quite the right question.  The right question is:  what kinds of data can a keyboard-less device support, and therefore what kinds of applications are best suited to such devices? 

28 March 2011

Mr. Natural Answers Your Questions

A posting on the PostgreSQL/Performance group (having to do with the Intel 320 announcement, which I'll save for a different posting after the dust settles a bit) got me to looking again for published tests of SSD vs. HDD and databases.  As you can see, Dennis Forbes is listed in the links block.  I don't recall whether I've mentioned this post of his before; may haps I have.

But, in some sense related to the Intel 510/320 situation, he makes the salient point (my point since I discovered SSDs years ago) thusly:

Of course NoSQL yields the same massive seek gain of SSDs, but that's where you encounter the competing optimizations: By massively exploding data to optimize seek patterns, SSD solutions become that much more expensive. Digg mentioned that they turned their friend data, which I would estimate to be about 30GB of data (or a single X25-E 64GB with room to spare per "shard") with the denormalizing they did, into 1.5TB, which in the same case blows up to 24 X25-Es per shard.


Of course, his insight is rare, even among those I've read who've been positive about RDBMS/SSD synergy.  It's always been obvious:  normalized datastores are orders of magnitude smaller than their flatfile antecedents.  This smaller footprint comes with all the integrity benefits that Dr. Codd (and Chris Date, et al since) defined.  There are a whole lot of ostriches out in the wild, insisting that massive datastores are needed by their code.  What does it all mean Mr. Natural?  Don't mean shit.

25 March 2011

Shape Shifting

One thing that I really like about O'Reilly books is the Rep-Kover binding; the original better than the current, however.  I find that most computer texts are near interchangeable with respect to content.  It's nearly always marginal, so what matters is ease of use.  For that, Rep-Kover is better than current "hardcover" bindings.  What I tend to dislike about O'Reilly is their (his?) incessant need to create "new" memes in the computing world.  Web 2.0 is, I think, the first; certainly the most infamous so far.

The last few months have seen the aborning of another: Data Science.  This one is even worse, in that it seeks to dumb down a perfectly legitimate pair of professions; statistician and operations researcher.  Long ago, I got involved in ISO-9000 certification, which was another early attempt to dumb down those professions (these days it's Six Sigma, which I had the pleasure to mentor at CSC).  It irritated me then, too.  It's of a piece with DIY neurosurgery, although not as directly deadly.

Yesterday's Forbes on-line version published a story about this newfangled profession, in the context of EMC.  Regular readers may remember that STEC, gorilla of the Enterprise SSD jungle, first touted, then crashed, on its relationship to EMC.  The article whispers that STEC, or whoever is currently supplying, is and will do well.

What's most bothersome about this meme is, as many others have remarked, both math stats and ORs do inferential stats, and inferential stats is based on the math of sampling and inference.  The fact is, one needn't have much training to calculate the parameters of populations.  Fact is, math stats and ORs don't even refer to these numbers as statistics, because they aren't.  It is exactly the same as baseball stats; they aren't stats, just numbers.  But, of course, the meme-sters once again wish to wrap themselves in the blanky of higher math. 

On the other hand, stats as a profession and work product is more interesting than computers.  Even databases, by golly.  May be I'll try to parlay both; the article says that such folks (humble self qualifies) are in demand. 

11 March 2011

You Rook Mahvelous

Just when you've figured it out, the world has a habit of slapping you upside the head.  By now, you've likely heard that Intel has announced its next SSD, the 510.  It's not, explicitly, the X-25/G3.  From the various sources I read, a G3 will be coming along in due time, but is intended to be the "consumer" version, while the newly announced 510 is the "pro" part.

Here's what's puzzling:  as you can see from this AnandTech article, the 510 is biased toward *sequential* processing!  Boy howdy, I never saw that coming.  That, and the fact that the controller isn't home grown, but bought in from Marvell.  The G3 is said to be driven by Intel's controller, but not yet confirmed.

The world has been turned upside down.  Either that, or Intel has completely misread both the technical and buyer worlds.  A sequentially biased SSD makes sense for consumers:  gamers, video processing, and such.  I'm truly puzzled.  The parts aren't big enough in capacity to store anything like the massive files that a file based coder would use.  For the prosumer world that the X25 parts targeted, the 510 just won't be useful, it's barely on par with the X25-G2. 

We'll see.  The 510 still has an advantage over the SandForce drives for compressed/encrypted data, but that's usually things like my beloved relational databases and random processing, not the 510's strength.  Weird.

04 March 2011

32 Heads Are Better Than One

Simple-talk is one of my favorite sites, and now they have what will be a series on parallelism in SQL Server.  This first installment is light and airy.

A few months ago, I got into a bit of a tiff on a Postgres email group when I had the temerity to suggest that query level parallelism is not only a Good Thing, but the only way to maintain performance as we segue from ever faster clocks on single thread/core cpu's to multi-thread/core/processor machines.  That group assembled (I don't recall anyone joining in my defense) asserted that engine level parallelism (doling out queries to threads) was enough.

I've been arguing for years that RDBMSs (not just SQL Server) will be better applications if they're designed to the multi-core/processor/SSD machine.  After all, at least the multi-core part is now fait accompli, so why not?  The beneficial side-effect is that BCNF schemas, with SSD as *primary* storage, are fully feasible.  They are the minimal data (bytes, that is) needed to fulfill demand, and since they are "fully" normalized, DRI implements most if not all of the constraints on the data.  That's been written about here, a bit.

For all the heat that MicroSoft gets, even from me on occasion, they do get databases.  Good on them.

For completeness, here's DB2 docs, rather dry, but then...

And here's Oracle.

Neither is as seamless, at first blush, as SQL Server.  There, I said something nice about a MicroSoft product.