Get fields by name in Pig?

Is there anyway I can get the value of a field by its name? For instance if I need to get like 10 fields, is there no other way than to do input.get(i) from 0 to 9? I am new to Pig, so I am interested in knowing why this is the case. Is there something like a tuple.getByFieldName('Field Name') ?

4,664 8 8 gold badges 30 30 silver badges 38 38 bronze badges asked Jun 7, 2013 at 21:42 99 2 2 silver badges 7 7 bronze badges

3 Answers 3

This is not possible, nor would it be very good design to allow it. Pig field names are like variable names. They allow you to give a memorable name to something that gives you insight into what it means. If you use those names in your UDF, you are forcing every Pig script which uses the UDF to adhere to the same naming scheme. If you decide later that you want to think of your variables a little differently, you can't reflect that in their names because the UDF would not function anymore.

The code that reads data from the input tuple in your UDF is like a function declaration. It establishes how to treat each argument to the function.

If you really want to be able to do this, you can build a map easily enough using the TOMAP builtin function, and have your UDF read from the map. This greatly hurts the reusability of your UDF for the reasons mentioned above, but it is nevertheless a fairly simple workaround.

answered Jun 9, 2013 at 2:11 5,791 1 1 gold badge 19 19 silver badges 32 32 bronze badges

While I agree that function flexibility would be affected if you use field names, technically it is possible to access fields by names.

The trick is to use inputSchema available through getInputSchema() and get the mapping between field indexes and names from there. You can also override outputSchema and build the mapping there, using inputSchema parameter. Then you would be able to use this mapping in your exec method.

answered Sep 29, 2016 at 18:51 61 6 6 bronze badges

I don't think you can access field by name. You need a structure similar to map to achieve that. In Pig's context, even though you cannot do it by name you can still rely on position if the input (load)'s schema is properly defined and consistent.

The maximum you can do is to validate type of fields you are ingesting in the UDF.

On the other hand, you can use implement "outputSchema" in your UDF to publish its output by name. UDF Manual

answered Jun 8, 2013 at 2:21 256 2 2 silver badges 9 9 bronze badges

Related

Hot Network Questions

Subscribe to RSS

Question feed

To subscribe to this RSS feed, copy and paste this URL into your RSS reader.

Site design / logo © 2024 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2024.9.11.15092