March 2013 – Matt Faus

Say you have a string representing a unix timestamp that is stored in scientific notation (e.g. '8.534592E8'), and you want to extract the year from that timestamp. If you are using Hive 0.11, you could use the new decimal data type to handle scientific notation and quickly extract the date you need.

select year(from_unixtime(cast(cast('8.534592E8' as decimal) as bigint)))

If, like me, you’re running Hive within Amazon Elastic MapReduce which only supports Hive version 0.7.1 and 0.8.1, you’ll have to do it manually. This is how I did it.

Find the string before the period.
Find the string between the period and the ‘E’. We’ll call this string X.
Find the int after the ‘E’. We’ll call this int Y.
Concatenate strings from steps 1 and 2 with (Y – len(X)) ‘0’ characters.
Cast the string from step 4 to a bigint, this is your standard notation unixtime.

select
year(
from_unixtime(
cast(
    concat(
        -- Before the period
        regexp_extract('8.534592E8', '(.*)(\.)(.*)(E)(.*)', 1),
        -- Between the period and the E
        regexp_extract('8.534592E8', '(.*)(\.)(.*)(E)(.*)', 3),
        repeat(
            '0',
            -- Right-pad with this many 0's
            cast(regexp_extract('8.534592E8', '(.*)(\.)(.*)(E)(.*)', 5) as int)
                - length(regexp_extract('8.534592E8', '(.*)(\.)(.*)(E)(.*)', 3))
        )
    ) as bigint)
)
)

In my case, the strings were inside a json-encoded object and not all of them were in scientific notation. This version takes care of those with some additional logic that checks for the existence of the character ‘E’ in the string. I attempted to use a LATERAL VIEW to only decode the json object once, but kept getting a cryptic ArrayIndexOutOfBoundsException (details below, let me know if you have any clues on this).

select
    year(
    from_unixtime(
    cast(
    if(instr(get_json_object(userdata.json, '$.birthdate'), 'E') > 0
        , -- Convert from scientific notation
        concat(
            -- Before the period
            regexp_extract(get_json_object(userdata.json, '$.birthdate'), '(.*)(\.)(.*)(E)(.*)', 1),
            -- Between the period and the E
            regexp_extract(get_json_object(userdata.json, '$.birthdate'), '(.*)(\.)(.*)(E)(.*)', 3),
            repeat(
                '0',
                -- Right-pad with this many 0's
                cast(regexp_extract(get_json_object(userdata.json, '$.birthdate'), '(.*)(\.)(.*)(E)(.*)', 5) as int)
                    - length(regexp_extract(get_json_object(userdata.json, '$.birthdate'), '(.*)(\.)(.*)(E)(.*)', 3))
            )
        )
        , -- Not in scientific notation
        get_json_object(userdata.json, '$.birthdate')
    ) as bigint ))) as bdayyear,
    count(1)
from userdata;

Now, the next step is fixing this at write-time so we don’t have to deal with this madness anymore.

[1]:

FAILED: Hive Internal Error: java.lang.ArrayIndexOutOfBoundsException(-1)
java.lang.ArrayIndexOutOfBoundsException: -1
    at java.util.ArrayList.get(ArrayList.java:324)
    at org.apache.hadoop.hive.ql.optimizer.ColumnPrunerProcFactory$ColumnPrunerSelectProc.process(ColumnPrunerProcFactory.java:397)
    at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
    at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
    at org.apache.hadoop.hive.ql.optimizer.ColumnPruner$ColumnPrunerWalker.walk(ColumnPruner.java:143)
    at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
    at org.apache.hadoop.hive.ql.optimizer.ColumnPruner.transform(ColumnPruner.java:106)
    at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:87)
    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7339)
    at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:261)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:218)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:409)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:567)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

Matt Faus

Month: March 2013

Manually Parsing Scientific Notation in Hive 0.81