24 December 2011

Head In The Cloud (or Somewhere)

Came across this posting whilst surfing, with the following:

"For example, one large package deliverer sees a 400% increase in its network traffic, database needs, and computing power over a single 45 day window. Utilizing a service such as Azure will allow them to pay for that excess capacity only in the 45 days it is needed, not the other 320 days a year when it is not utilized, but must be maintained, upgraded, and licensed."

Why folks persist in believing that there is a Free Lunch up in The Cloud??? The notion that seasonal/repetitive demand spikes for IT resources are uniformly distributed over a time period (day/week/year) is just silly. Yes, as a loss leader, a Cloud Vendor may choose to not add a Load Factor Penalty. In the beginning. But, that's not sustainable, because demand spikes aren't uniformly distributed. Cloud vendors will make the clients pay for all that idle storage and cpu (and a nice profit on the idle riches), you betcha. That package deliverer experienced the same spike as all of the retail chain. And all of the energy vendors. And so on. It's Econ. 101, folks.

21 December 2011

Icarus

The sky is falling. Oracle reported down, a bit, but, in particular, didn't report above expectations. Larry has been sly for a long time, in setting guidance low enough that bettering it is a piece of cake. Not this time. As I type, it's down 15%, and news is that much of the tech sector is getting the flu.

The knee jerk reaction: mortgage the farm and buy Oracle stock. IIIIIIIIIIII'm not so sure this time. Here's why.

Oracle didn't get quite the *new* software sales and *new* hardware sales. The latter is, by all accounts, due to customers waiting on the new machines during the quarter. The former is more speculative. The reports are vague. My take: given the aggressive pricing of Oracle RDMBS, ditto for MySql (yes, it's GPL, but Oracle blasted its support prices into the sky in the past year), and putting the screws to java adopters; folks are looking for a safer port.

On the RDBMS side, Postgres gets ever closer to Oracle, if you're not a Fortune X00 company (and even if you are, and building apps off the mission critical axis). Mainstream pundits are crying that "Da Cloud, boss, Da Cloud" is putting Oracle in an untenable position. It is said that cloud providers use dirt cheap components, soft and hard, and Oracle's RDBMS and Sun-ish machines are just too expensive Up There. As if being cheap were the best way to make money!? "Cheap goods sold dear" is an aphorism that's been around forever. The Cloud is shaping up that way, and if so, I'd avoid Fortune X00 companies that chose to put *my data* Up There. In this nascent era of Cloud, too many stories of wandering data to suit me.

What's really stupid about Larry's ploy: the Oracle RDBMS is built on an engine (the piece that actually does all the inserting and updating) which uses Multiversion Concurrency Control (MVCC, as it is known) which is better suited to the asynchronous nature of the Web/Cloud than the locker paradigm that most other (notably, not Postgres) RDBMS have been using for decades. They've been backing in, so to speak, MVCC support recently, but none is a true MVCC database. In other words, Larry has the proper mousetrap for the setting, but has managed to offend his customers. But, that's Larry's way.

Reports say that Oracle claims the shortfall is due to last minute non-signings. If so, then this is an aberrant glitch. Given that Fortune X00 companies are sitting on, by some accounts, more than $3 trillion, there's no macro reason to not buy new IT. Unless you're a Fat Man yearning for Famine.

20 December 2011

A Warren-ted Search

One of the points "for further research" as I used to say when I was an academic, in the Triage exercise was using social media to measure outcomes. R has a library, twitteR, (yes, R folks tend to capitalize the letter at every opportunity), which retrieves some data. I was at first disinterested, since I don't have a twitter account. Thankfully, twits can be gotten without being a twitterer. Since Elizabeth Warren's campaign is just over the border, and sort of important in the grand scheme of things, I've been exploring.

Here's the entirety of the R code (as seen in an Rstudio session) needed to return the twits (1,500 is the max, which will prove troublesome when the battle is fully engaged):

> library(twitteR)
> warrenTweets <- searchTwitter('@elizabethwarren', n = 1500)
> length(warrenTweets)
[1] 9
> warren.Text <- laply(warrenTweets, function(t) t$getText())
> head(warren.Text, 10)
[1] "@elizabethwarren i hope you win agianst sen scott brown. the 99% r with u"
[2] "@elizabethwarren More $$$ coming your way!"
[3] "#HR3505 PAGING: @ElizabethWarren Help us!!!!"
[4] "@elizabethwarren - not to worry, the only job Karl Rove ever got somebody was George W. Bush. and look how that turned out."
[5] "RT @SenatorBuono: What an amazing turnout 4 a superstar. @elizabethwarren"
[6] "HELLO @ElizabethWarren ! PLEASE RUN as a 3rd party or Ind. FOR POTUS2012. Dems just threwSENIORS underthebus for the working tax cut! EXdem"
[7] "@chucktodd We hope 2011 will be remembered for something a LOT closer to home. #ows #OccupyWallStreet @ElizabethWarren #WARREN/PELOSI-2016"
[8] "RT @SenatorBuono: What an amazing turnout 4 a superstar. @elizabethwarren"
[9] "What an amazing turnout 4 a superstar. @elizabethwarren"


The lines starting with > is the R code. The lines starting with [x] are the output. Here we have 9 twits.

Now, what do we do with the text? For that, I'll send you off to this presentation which came up in my R/twitter search (and is the source of what you've seen here), conducted in Boston. Missed it, dang. With slide 11, is the explanation of how one might parse the twits looking for positive/negative response. By the way, even if you're not the least bit interested in such nonsense, visit slide 29.

As I mentioned in Triage and follow-ups, getting the outcomes data is the largest piece of the work. Simply being able to "guarantee" the accuracy of twitter (or any other uncontrolled source) data, given the restriction on returned twits and such, will require some level of data sophistication; which your average Apparatchik likely doesn't care about. The goal, I'll mention again, isn't to emulate Chris Farley's Matt Foley and pump up a candidate no matter what the data say, but to find the candidate out of many most likely to win given some help. Whether Triage would be useful to a single candidate; well, that depends on the inner strength of the candidate.

19 December 2011

Ya Can't Get Theah From Heah

I'm working my way through Wilkinson's graphics book, and right there, on page 404 (could there possibly be a better page for this?) is this:
"The price paid for this efficiency is the loss of flexibility. We cannot examine relations not represented in the hierarchies. Navigation through OLAPs is quite difficult, which is why so many graphical models for representing them have been proposed."

Now, this is a graphics coding guy; he created SYSTAT. He's not, so far as I know, an RM guy. But he gets it.

16 December 2011

Lies, Damn Lies, and Statistics

The Other-R folks have posted a recent entry which references an EMC paper (here if you follow the breadcrumbs) on the state of Data Analysis and Business Intelligence, from the point of view of practitioners. The blog post makes some useful points, but misses some.

I'm referring to the graph in the original post, which is on page 3 (in my screen) of the EMC paper.

What this graph tells me, mostly, is that BI folks are still tied to MS, Excel in particular. Data analysts, not so much; although they'll be tied to corporate policy in such venues.

A few words about each.

Data Storage: SQL Server is tops, which means that most folks, in both camps, are tied to corporate group level machines, not the Big Iron. It's been that way for decades; the analysts have to extract from the Big Iron, and crunch on their own PCs. The categories Other SQL, Netzza, and Greenplum leave room for the Triage with PL/R approach, since the latter are explicitly Postgres and Other SQL is likely as much Postgres as MySql (yuck!). The category is, possibly, misleading if one jumps to the conclusion that companies are MS centric with their data.

Data Management: No real surprise here. Excel is the tool of choice. Way back when I was teaching PC software courses, 1-2-3 was the spreadsheet of choice and all data went through it, and Excel inherited the mindset that a spreadsheet was sophisticated analysis. It is a bit unnerving to realize that so much of what corporations decide is supported by such drek. Note: the BI folks, in the past executive assistants and "secretaries", still use spreadsheets a lot. The Data folks, the other way round. There is small comfort in that. The presence of BASH (or Korn or ...) and AWK (Python and Perl too, but not quite so much; each has bespoke language I/O in the mix) is interesting, in that it means that a fair amount of data is clear text ASCII files. Think about that for a second.

Data Analysis: Clearly, the Data folks use stat packs while the BI folks mostly don't. SAS and SPSS and Stata leading says that the EMC client base is largely large corporate, which isn't a surprise. What is a surprise is the absence of Excel. On the other hand, in the original paper is this (next by each the graph): "While most BI professionals do their *analysis* and data processing in Excel, data science professionals are using SQL, advanced statistical packages...", which corresponds to my experience (emphasis mine).

Data Visualization: The absence of R is suspect, as any R user would understand.

And, finally, this has nothing to do with Big Data, in any case. BD is just another attempt to money-spin by those with an agenda. Janert, in his book "Data Analysis...", makes clear that BD isn't worth the trouble (my inference). The point being that population data, which is what BD offers, is just descriptive stats, and smart data folks aren't interested in descriptive stats. Sports fans, well yeah.

13 December 2011

Jumpin Jack Flash [UPDATE]

In today's posting of the Momentus XT review, Anand (nearly) ends it with this:
"Longer term it's unclear to me whether hybrid drives like the Momentus XT will fill the gap left by SSDs or if software based caching technologies combined with NAND on motherboards will be the preferred route."

Oddly, I wondered much the same a few years back when Sun (pre-Oracle) announced its flash "appliance". That, of course, was for near-mainframe level servers, but was an early assault on the flash-as-HDD approach to using flash. There've been discussions about, particularly in the context of PCIe (Fusion-io, of course) "drives", whether there need be this SSD-as-HDD approach. More folks are talking about storage now as just flash, pretty much, directly wired to the cpu and/or memory manager. As Linus said all those years ago, it will change how file systems are built; if we bother with them at all.

[UPDATE]
Not to mention, this tickled a lower brain stem memory: winFS. Go spend some time with the WikiPedia article. Mayhaps SSD will be the crutch needed?

11 December 2011

Parallax View

"The Parallax View" was a fun movie, especially for those with a conspiracy slant on the world. One might even see it as harbinger of "Robocop". Well, it turns out that there is at least one other take on scatterplot matrix. In the context of the authors' data, it makes some sense.

I tried it with the Triage database table, but am not so convinced it helps for the discrete data proposed. Have a look, and see what you think:

CREATE OR REPLACE FUNCTION "public"."pairs_graph" () RETURNS text AS
$BODY$
library(ggplot2)
library(GGally)
X11(display=':5');
events <- pg.spi.exec ("select period, event, amount, choice, guncontrol, outcome from public.events where candidate = 'Doe' ");
png('ggpairs_graph.png');
events$gcfactor <- as.factor(events$guncontrol)
p <- ggpairs(events, columns=c("period", "amount", "choice", "gcfactor"),
    diag=list(continuous="density",   discrete="bar"), axisLabels="show")
print(p)
dev.off();
print('done');
$BODY$
LANGUAGE 'plr'
And here's the graph:

06 December 2011

Smaller, But Not Worse

Just when the future looks most bleak, comes this news, today. IMFT announces that 20nm flash is in production, and that (look at the pretty picture) 128GB will soon be at your finger tip. The industry may just turn around and do the faster tape waltz. On the other hand, with erase cycles *not* deteriorating, 5NF for large systems is in reach. Perhaps, just may be, The Yellow Brick Road isn't the Road to Perdition (Jude was creepy).